Initial import

This commit is contained in:
Con Kolivas 2010-03-29 10:07:08 +11:00
parent 725e478e19
commit 6dcceb0b1b
69 changed files with 26485 additions and 0 deletions

44
doc/README.Assembler Normal file
View file

@ -0,0 +1,44 @@
README.Assembler
Notes about CRC Assembly Language Coding.
lrzip-0.21 makes use of an x86 assembly language file
that optimizes CRC computation used in lrzip. It includes
a wrapper C file, 7zCrcT8.c and the assembler code,
7zCrcT8U.s.
configure should detect your host system properly
and adjust the Makefile accordingly. If you don't
have the nasm assembler or have a ppc or other non-
x86 system, the standard C CRC routines will be
compiled and linked in.
If for any reason configure does not properly
detect your system type, or you do not want assembler
modules to be compiled, you can run
ASM=no ./configure
which will automatically not include the asm module or
change the line
ASM_OBJ=7zCrcT8.o 7zCrcT8U.o
to
ASM_OBJ=7zCrc.o
in Makefile. This will change the dependency tree.
To force assembly module compilation and linking (if
configure does not detect your system type properly),
type
ASM=yes ./configure
or change the Makefile to include the ASM_OBJ files
as described above.
Type `make clean' and then re-run make.
Peter Hyman
pete@peterhyman.com

120
doc/README.benchmarks Normal file
View file

@ -0,0 +1,120 @@
These are benchmarks performed on a 3GHz quad core Intel Core2 with 8GB ram
using lrzip v0.42.
The first comparison is that of a linux kernel tarball (2.6.31). In all cases
the default options were used. 3 other common compression apps were used for
comparison, 7z which is an excellent all-round lzma based compression app,
gzip which is the benchmark fast standard that has good compression, and bzip2
which is the most common linux used compression.
In the following tables, lrzip means lrzip default options, lrzip(lzo) means
lrzip using the lzo backend, lrzip(gzip) means using the gzip backend,
lrzip(bzip2) means using the bzip2 backend and lrzip(zpaq) means using the zpaq
backend.
linux-2.6.31.tar
Compression Size Percentage Compress Decompress
None 365711360 100
7z 53315279 14.6 2m4.770s 0m5.360s
lrzip 52372722 14.3 2m48.477s 0m8.336s
lrzip(zpaq) 43455498 11.9 10m11.335 10m14.296
lrzip(lzo) 112151676 30.7 0m14.913s 0m5.063s
lrzip(gzip) 73476127 20.1 0m29.628s 0m5.591s
lrzip(bzip2) 60851152 16.6 0m43.539s 0m12.244s
bzip2 62416571 17.1 0m44.493s 0m9.819s
gzip 80563601 22.0 0m14.343s 0m2.781s
These results are interesting to note the compression of lrzip by default is
only slightly better than lzma, but at some cost in time at the compress and
decompress end of the spectrum. Clearly zpaq compression is much better than any
other compression algorithm by far, but the speed cost on both compression and
decompression is extreme. At this size compression, lzo is interesting because
it's faster than simply copying the file but only offers modest compression.
What lrzip offers at this end of the spectrum is extreme compression if
desired.
Let's take two kernel trees one version apart as a tarball, linux-2.6.31 and
linux-2.6.32-rc8. These will show lots of redundant information, but hundreds
of megabytes apart, which lrzip will be very good at compressing. For
simplicity, only 7z will be compared since that's by far the best general
purpose compressor at the moment:
Tarball of two kernel trees, one version apart.
Compression Size Percentage Compress Decompress
None 749066240 100
7z 108710624 14.5 4m4.260s 0m11.133s
lrzip 57943094 7.7 3m08.788s 0m10.747s
lrzip(lzo) 124029899 16.6 0m18.997s 0m7.107s
Things start getting very interesting now when lrzip is really starting to
shine. Note how it's not that much larger for 2 kernel trees than it was for
one. That's because all the similar data in both kernel trees is being
compressed as one copy and only the differences really make up the extra size.
All compression software does this, but not over such large distances. If you
copy the same data over multiple times, the resulting lrzip archive doesn't
get much larger at all.
Using the first example (linux-2.6.31.tar) and simply copying the data multiple
times over gives these results with lrzip(lzo):
Copies Size Compressed Compress Decompress
1 365711360 112151676 0m14.913s 0m5.063s
2 731422720 112151829 0m16.174s 0m6.543s
3 1097134080 112151832 0m17.466s 0m8.115s
I had the amusing thought that this compression software could be used as a
bullshit detector if you were to compress peoples' speeches because if their
talks were full of catchphrases and not much actual content, it would all be
compressed down. So the larger the final archive, the less bullshit =)
Now let's move on to the other special feature of lrzip, the ability to
compress massive amounts of data on huge ram machines by using massive
compression windows. This is a 10GB virtual image of an installed operating
system and some basic working software on it. The default options on the
8GB machine meant that it was using a 5 GB window.
10GB Virtual image:
Compression Size Percentage Compress Time Decompress Time
None 10737418240 100.0
gzip 2772899756 25.8 7m52.667s 4m8.661s
bzip2 2704781700 25.2 20m34.269s 7m51.362s
xz 2272322208 21.2 58m26.829s 4m46.154s
7z 2242897134 20.9 29m28.152s 6m35.952s
lrzip 1361276826 12.7 27m45.874s 9m20.046
lrzip(lzo) 1837206675 17.1 4m48.167s 8m28.842s
lrzip(zpaq) 1341008779 12.5 4h11m14s
lrzip(zpaq)M 1270134391 11.8 4h30m14
lrzip(zpaq)MW 1066902006 9.9
At this end of the spectrum things really start to heat up. The compression
advantage is massive, with the lzo backend even giving much better results
than 7z, and over a ridiculously short time. Note that it's not much longer
than it takes to just *read* a 10GB file. Unfortunately at these large
compression windows, the decompression time is significantly longer, but
it's a fair tradeoff I believe :) What appears to be a big disappointment is
actually zpaq here which takes more than 8 times longer than lzma for a measly
.2% improvement. The reason is that most of the advantage here is achieved by
the rzip first stage. The -M option was included here for completeness to see
what the maximum possible compression was for this file on this machine, while
the MW run was with the options -W 200 (to make the window larger than the
file and the ram the machine has), and it still completed but induced a lot
of swap in the interim.
This should help govern what compression you choose. Small files are nicely
compressed with zpaq. Intermediate files are nicely compressed with lzma.
Large files get excellent results even with lzo provided you have enough ram.
(Small being < 100MB, intermediate <1GB, large >1GB).
Or, to make things easier, just use the default settings all the time and be
happy as lzma gives good results. :D
Con Kolivas
Sat, 19 Dec 2009

View file

@ -0,0 +1,118 @@
An explanation of the revised lzo_compresses function in stream.c.
The modifications to the lrzip program for 0.19 centered around an
attempt to catch data chunks that would cause lzma compression to either
take an inordinately long time or not complete at all. The files that
could cause problems for lzma are already-compressed files, multimedia
files, files that have compressed files in them, and files with
randomized data (such as an encrypted volume or file).
The lzo_compresses function is used to assess the data and return
a TRUE or FALSE to the lzma_compress_buf function based on whether or
not the function determined the data to be compressible or not. The
simple formula cdata < odata was used (c=compressed, o=original).
Some test cases were slipping through and caused the hangups. Beginning
with lrzip-0.19 a new option, -T, test compression threshold has been
introduced and sets configurable limits as to what is considered a
compressible data chunk and what is not.
In addition, with very large chunks of data, a small modification was
made to the initial test buffer size to make it more representative of
the entire sample.
To go along with this, increased verbosity was added to the function
so that the user/evaluator can better see what is going on. -v or -vv
can be used to increase informational output.
Functional Overview:
Data chunks are passed to the lzo_copresses function in two streams.
The first is the small data set in the primary hashing bucket which
can be seen when using the -v or -vv option. This is normally a small
sample. The second stream will be the rest. The size of the streams
are dependent on how the long range analysis that is performed on
the entire file and available memory.
After analysis of the data chunk, a value of TRUE or FALSE is returned
and lzma compression will either commence or be skipped. If skipped,
data written out to the .lrz file will simply be the rzip data which
is the reorganized data based on long range analysis.
The lzo_compresses function traverses through the data chunk comparing
larger and larger blocks. If suitable compression ratios are found,
the function ends and returns TRUE. If not, and the largest sample
block size has been reached, the function will traverse deeper into
the chunk and analyze that region. Anytime a compressible area is
found, the function returns TRUE. When the end of the data chunk has
been reached and no suitable compressible blocks found, the program
will return FALSE.
Under most circumstances, this logic was fine. However, if the test
found a chunk that could only achieve 2% compression, for example,
this type of result could adversely affect the lzma compression
routine. Hence, the concept of a limiting threshold.
The threshold option works as a limiter that forces the lzo_compresses
function to not just compare the estimated compressed size with the
original, but to add a limiting threshold. This ranges a very low
threshold, 1, to a very strict, 10. A threshold of 1 means that for
the function to return TRUE, the estimated compressed data size for
the current data chunk can be between 90-100% of the original size.
This means that almost no compressible data is observed or tested for.
A value of 2, means that the data MUST compress better than 90% of
the original size. However, if the observed compression of the data
chunk is over 90% of the original size, then lzo_compresses will fail.
Each additional threshold value will increase the strictness according
to the following formula
CDS = Observed Compressed Data Size from LZO
ODS = Original Data chunk size
T = Threshold
To return TRUE, CDS < ODS * (1.1-T/10)
At T=1, just 0.01% compression would be OK,
T=2, anything better than 10% would be OK, but under 10% compression would fail.
T=3, anything better 20% would be OK, but under 20% compression would fail.
...
T=10, I can't imagine a use for this. Anything better than 90% compression
would be OK. This would imply that LZO would need to get a 10x compression
ratio.
The following actual output from the lzo_compresses function will help
explain.
22501 in primary bucket (0.805%)
lzo testing for incompressible data...OK for chunk 43408.
Compressed size = 52.58% of chunk, 1 Passes
Progress percentage pausing during lzma compression...
lzo testing for incompressible data...FAILED - below threshold for chunk 523245383.
Compressed size = 98.87% of chunk, 50 Passes
This was for a video .VOB file of 1GB. A compression threshold of 2 was used.
-T 2 means that the estimated compression size of the data chunk had to be
better than 90% of the original size.
There were 43,408 bytes in the primary hash bucket and this chunk was
evaluated by lzo_compresses. The function estimated that the compressed
data size would be 52.58% of the original 43,408 byte chunk. This resulted
in LZMA compression occurring.
The second data chunk which included the rest of the data in the current hash,
523,245,383 bytes, failed the test. the lzo_compresses function made 50 passes
through the data using progressively larger samples until it reached the end
of the data chunk. It could not find better than a 1.2% compression benefit
and therefore FAILED, The result was NO LZMA compression and the data chunk
was written to the .lrz file in rzip format (no compression).
The higher the threshold option, the faster the LZMA compression will occur.
However, this could also cause some chunks that are compressible to be
omitted. After much testing, -T 2 seems to work very well in stopping data
which will cause LZMA to hang yet allow most compressible data to come
through.
Peter Hyman
pete@peterhyman.com
December 2007

45
doc/lrzip.conf.example Normal file
View file

@ -0,0 +1,45 @@
# lrzip.conf example file
# anything beginning with a # or whitespace will be ignored
# valid parameters are separated with an = and a value
# parameters and values are not case sensitive
#
# lrzip 0.24, peter hyman, pete@peterhyman.com
# ignored by earlier versions.
# Compression Window size in 100MB. Normally selected by program.
WINDOW = 5
# Compression Level 1-9 (7 Default).
COMPRESSIONLEVEL = 7
# Compression Method, rzip, gzip, bzip2, lzo, or lzma (default).
# If specified here, command line options not usable.
# COMPRESSIONMETHOD = lzo
# Test Threshold value 1-10 (2 Default).
TESTTHRESHOLD = 2
# Default output directory
# OUTPUTDIRECTORY = location
# Verbosity, true or 1, or max or 2
VERBOSITY = max
# Show Progress as file is parsed, true or 1, false or 0
SHOWPROGRESS = true
# Set Niceness. 19 is default. -20 to 19 is the allowable range
NICE = 19
# Delete source file after compression
# this parameter and value are case sensitive
# value must be YES to activate
# DELETEFILES = NO
# Replace existing lrzip file when compressing
# this parameter and value are case sensitive
# value must be YES to activate
# REPLACEFILE = YES

41
doc/magic.header.txt Normal file
View file

@ -0,0 +1,41 @@
lrzip-0.40+ file header format
November 2009
Con Kolivas
Byte Content
0-3 LRZI
4 LRZIP Major Version Number
5 LRZIP Minor Version Number
6-14 Source File Size
16-20 LZMA Properties Encoded (lc,lp,pb,fb, and dictionary size)
21-22 not used
23-48 Stream 1 header data
49-74 Stream 2 header data
Block Data:
Byte:
0 Compressed data type
1-8 Compressed data length
9-16 Uncompressed data length
17-24 Next block head
25+ Data
End:
0-1 crc data
lrzip-0.24+ file header format
January 2009
Peter Hyman, pete@peterhyman.com
Byte Content
0-3 LRZI
4 LRZIP Major Version Number
5 LRZIP Minor Version Number
6-9 Source File Size (no HAVE_LARGE_FILES)
6-14 Source File Size
16-20 LZMA Properties Encoded (lc,lp,pb,fb, and dictionary size)
21-22 not used
23-36 Stream 1 header data
37-50 Stream 2 header data
51 Compressed data type