lrzip/doc/README.benchmarks

The first comparison is that of a linux kernel tarball (2.6.31). In all cases
the default options were used. 3 other common compression apps were used for
comparison, 7z which is an excellent all-round lzma based compression app,
gzip which is the benchmark fast standard that has good compression, and bzip2
which is the most common linux used compression.

In the following tables, lrzip means lrzip default options, lrzip -l means
lrzip using the lzo backend, lrzip -g means using the gzip backend,
lrzip -b means using the bzip2 backend and lrzip -z means using the zpaq
backend.


linux-2.6.31.tar

These are benchmarks performed on a 3GHz quad core Intel Core2 with 8GB ram
using lrzip v0.42.

Compression	Size		Percentage	Compress	Decompress
None		365711360	100
7z		53315279	14.6		2m4s		0m5.4s
lrzip		52372722	14.3		2m48s		0m8.3s
lrzip -z	43455498	11.9		10m11s		10m14s
lrzip -l	112151676	30.7		0m14s		0m5.1s
lrzip -g	73476127	20.1		0m29s		0m5.6s
lrzip -b	60851152	16.6		0m43s		0m12.2s
bzip2		62416571	17.1		0m44s		0m9.8s
gzip		80563601	22.0		0m14s		0m2.8s


These results are interesting to note the compression of lrzip by default is
only slightly better than lzma, but at some cost in time at the compress and
decompress end of the spectrum. Clearly zpaq compression is much better than any
other compression algorithm by far, but the speed cost on both compression and
decompression is extreme. At this size compression, lzo is interesting because
it's faster than simply copying the file but only offers modest compression.
What lrzip offers at this end of the spectrum is extreme compression if
desired.


Let's take six kernel trees one version apart as a tarball, linux-2.6.31 to
linux-2.6.36. These will show lots of redundant information, but hundreds
of megabytes apart, which lrzip will be very good at compressing. For
simplicity, only 7z will be compared since that's by far the best general
purpose compressor at the moment:

These are benchmarks performed on a 2.53Ghz dual core Intel Core2 with 4GB ram
using lrzip v0.5.1. Note that it was running with a 32 bit userspace so only
2GB addressing was posible. However the benchmark was run with the -U option
allowing the whole file to be treated as one large compression window.

Tarball of 6 consecutive kernel trees.

Compression	Size		Percentage	Compress	Decompress
None		2373713920	100
7z		344088002	14.5		17m26s		1m22s
lrzip		104874109	4.4		11m37s		56s
lrzip -l	223130711	9.4		05m21s		1m01s
lrzip -U	73356070	3.1		08m53s		43s
lrzip -Ul	158851141	6.7		04m31s		35s
lrzip -Uz	62614573	2.6		24m42s		25m30s

Things start getting very interesting now when lrzip is really starting to
shine. Note how it's not that much larger for 6 kernel trees than it was for
one. That's because all the similar data in both kernel trees is being
compressed as one copy and only the differences really make up the extra size.
All compression software does this, but not over such large distances. If you
copy the same data over multiple times, the resulting lrzip archive doesn't
get much larger at all. You might find this example interesting because the
-U option is actually faster as well as providing better compression. The
reason is that the window is not much larger than the amount of ram addressable
(2GB), and it compresses so much more in the rzip stage that it makes up the
time by not needing to compress anywhere near as much data with the backend
compressor.


Using the first example (linux-2.6.31.tar) and simply copying the data multiple
times over gives these results with lrzip(lzo):

Copies		Size		Compressed	Compress	Decompress
1		365711360	112151676	0m14.9s		0m5.1s
2		731422720	112151829	0m16.2s		0m6.5s
3		1097134080	112151832	0m17.5s		0m8.1s


I had the amusing thought that this compression software could be used as a
bullshit detector if you were to compress people's speeches because if their
talks were full of catchphrases and not much actual content, it would all be
compressed down. So the larger the final archive, the less bullshit =)

Now let's move on to the other special feature of lrzip, the ability to
compress massive amounts of data on huge ram machines by using massive
compression windows. This is a 10GB virtual image of an installed operating
system and some basic working software on it. The default options on the
8GB machine meant that it was using a 5 GB window.


10GB Virtual image:

These benchmarks were done on the quad core with version 0.5.1

Compression	Size		Percentage	Compress Time	Decompress Time
None		10737418240	100.0
gzip		2772899756	 25.8		05m47s		2m46s
bzip2		2704781700	 25.2		16m15s		6m19s
xz		2272322208	 21.2		50m58s		3m52s
7z		2242897134	 20.9		26m36s		5m41s
lrzip		1354237684	 12.6		29m13s		6m55s
lrzip -M	1079528708	 10.1		23m44s		4m05s
lrzip -l	1793312108	 16.7		05m13s		3m12s
lrzip -lM	1413268368	 13.2		04m18s		2m54s
lrzip -z	1299844906	 12.1		04h32m14s	04h33m
lrzip -zM	1066902006	  9.9		04h07m14s	04h08m


At this end of the spectrum things really start to heat up. The compression
advantage is massive, with the lzo backend even giving much better results than
7z, and over a ridiculously short time. The default lzma backend is slightly
slower than 7z, but provides a lot more compression. What appears to be a big
disappointment is actually zpaq here which takes more than 8 times longer than
lzma for a measly .2% improvement. The reason is that most of the advantage here
is achieved by the rzip first stage since there's a lot of redundant space over
huge distances on a virtual image. The -M option which works the memory
subsystem rather hard making noticeable impact on the rest of the machine also
does further wonders for the compression and times.

This should help govern what compression you choose. Small files are nicely
compressed with zpaq. Intermediate files are nicely compressed with lzma.
Large files get excellent results even with lzo provided you have enough ram.
(Small being < 100MB, intermediate <1GB, large >1GB).
Or, to make things easier, just use the default settings all the time and be
happy as lzma gives good results. :D

Con Kolivas
Tue, 7th Nov 2010