forked from radii/undup
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
77 lines (58 loc) · 2.39 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
undup - compress files by consolidating duplicate data
undup tries to compress an input stream by watching for blocks that have
previously appeared. It replaces the duplicated data with a backreference.
Integrity is ensured by validating a SHA256 across the entire stream at
reconstruction time.
undup is intended to be pipelined with a general-purpose compressor such as
gzip, bzip2, or xz.
USAGE
-----
tar cf - dir | undup | xz > dir.tar.undup.xz
xzcat dir.tar.undup.xz | undup -d | tar xv
SAMPLE RESULTS
--------------
% for r in 3.0 3.1 3.2 3.3-rc1; do
git archive --format=tar --prefix=linux-$r/ v$r | tar -C /tmp/linuxes -xf -
done
% tar -C /tmp -cf linuxes.tar linuxes
% du -shc /tmp/linuxes/*
500M /tmp/linuxes/linux-3.0
504M /tmp/linuxes/linux-3.1
511M /tmp/linuxes/linux-3.2
518M /tmp/linuxes/linux-3.3-rc1
2.0G total
File sizes:
1833635840 linuxes.tar
937173504 linuxes.tar.undp
404399664 linuxes.tar.gz
316914845 linuxes.tar.bz2
270460412 linuxes.tar.xz
203023371 linuxes.tar.undp.gz
167099750 linuxes.tar.lrz
159673153 linuxes.tar.undp.bz2
138929420 linuxes.tar.undp.xz
format ratio pipelined w/ undup
------ ----- ------------------
undp 1.95
gzip 4.53 9.03
bzip2 5.78 11.48
xz 6.78 13.19
lrzip 10.97
Timings for undup + compressors on Core i7 L 640 @ 2.13GHz (2.9 GHz Turbo)
First, we time the undup phase. This consumes a significant amount
of memory (for undup 0.2, about 105 MB of RAM to store hashes for the
1.8 GB linuxes.tar) and can be pipelined, but to get the most
reproducible timing results, we've run each phase separately.
undup linuxes.tar 47.26s user 4.15s system 97% cpu 52.885 total
Second, we compare times for various compressors to compress
linuxes.tar.undp.
gzip 35.81s user 0.72s system 96% cpu 37.817 total
bzip2 117.79s user 0.45s system 99% cpu 1:58.66 total
xz 606.51s user 1.31s system 99% cpu 10:09.72 total
undup + bzip2 achieves an 11.48x compression ratio while consuming only
165 seconds of CPU time; elapsed time for a pipeline is reasonably similar:
undup 59.64s user 3.93s system 32% cpu 3:14.76 total
bzip2 138.65s user 1.05s system 71% cpu 3:14.73 total
This compares favorably to lrzip 0.608, which achieves a 10.97x ratio after
consuming 913 seconds of CPU time (lrzip is multithreaded by default):
lrzip -v -w 10 linuxes.tar 913.08s user 14.99s system 298% cpu 5:10.78 total