mirror of
https://github.com/ckolivas/lrzip.git
synced 2026-04-05 06:15:28 +00:00
Initial import
This commit is contained in:
parent
725e478e19
commit
6dcceb0b1b
69 changed files with 26485 additions and 0 deletions
44
doc/README.Assembler
Normal file
44
doc/README.Assembler
Normal file
|
|
@ -0,0 +1,44 @@
|
|||
README.Assembler
|
||||
|
||||
Notes about CRC Assembly Language Coding.
|
||||
|
||||
lrzip-0.21 makes use of an x86 assembly language file
|
||||
that optimizes CRC computation used in lrzip. It includes
|
||||
a wrapper C file, 7zCrcT8.c and the assembler code,
|
||||
7zCrcT8U.s.
|
||||
|
||||
configure should detect your host system properly
|
||||
and adjust the Makefile accordingly. If you don't
|
||||
have the nasm assembler or have a ppc or other non-
|
||||
x86 system, the standard C CRC routines will be
|
||||
compiled and linked in.
|
||||
|
||||
If for any reason configure does not properly
|
||||
detect your system type, or you do not want assembler
|
||||
modules to be compiled, you can run
|
||||
|
||||
ASM=no ./configure
|
||||
|
||||
which will automatically not include the asm module or
|
||||
change the line
|
||||
|
||||
ASM_OBJ=7zCrcT8.o 7zCrcT8U.o
|
||||
to
|
||||
ASM_OBJ=7zCrc.o
|
||||
|
||||
in Makefile. This will change the dependency tree.
|
||||
|
||||
To force assembly module compilation and linking (if
|
||||
configure does not detect your system type properly),
|
||||
type
|
||||
|
||||
ASM=yes ./configure
|
||||
|
||||
or change the Makefile to include the ASM_OBJ files
|
||||
as described above.
|
||||
|
||||
|
||||
Type `make clean' and then re-run make.
|
||||
|
||||
Peter Hyman
|
||||
pete@peterhyman.com
|
||||
120
doc/README.benchmarks
Normal file
120
doc/README.benchmarks
Normal file
|
|
@ -0,0 +1,120 @@
|
|||
These are benchmarks performed on a 3GHz quad core Intel Core2 with 8GB ram
|
||||
using lrzip v0.42.
|
||||
|
||||
The first comparison is that of a linux kernel tarball (2.6.31). In all cases
|
||||
the default options were used. 3 other common compression apps were used for
|
||||
comparison, 7z which is an excellent all-round lzma based compression app,
|
||||
gzip which is the benchmark fast standard that has good compression, and bzip2
|
||||
which is the most common linux used compression.
|
||||
|
||||
In the following tables, lrzip means lrzip default options, lrzip(lzo) means
|
||||
lrzip using the lzo backend, lrzip(gzip) means using the gzip backend,
|
||||
lrzip(bzip2) means using the bzip2 backend and lrzip(zpaq) means using the zpaq
|
||||
backend.
|
||||
|
||||
|
||||
linux-2.6.31.tar
|
||||
|
||||
Compression Size Percentage Compress Decompress
|
||||
None 365711360 100
|
||||
7z 53315279 14.6 2m4.770s 0m5.360s
|
||||
lrzip 52372722 14.3 2m48.477s 0m8.336s
|
||||
lrzip(zpaq) 43455498 11.9 10m11.335 10m14.296
|
||||
lrzip(lzo) 112151676 30.7 0m14.913s 0m5.063s
|
||||
lrzip(gzip) 73476127 20.1 0m29.628s 0m5.591s
|
||||
lrzip(bzip2) 60851152 16.6 0m43.539s 0m12.244s
|
||||
bzip2 62416571 17.1 0m44.493s 0m9.819s
|
||||
gzip 80563601 22.0 0m14.343s 0m2.781s
|
||||
|
||||
|
||||
These results are interesting to note the compression of lrzip by default is
|
||||
only slightly better than lzma, but at some cost in time at the compress and
|
||||
decompress end of the spectrum. Clearly zpaq compression is much better than any
|
||||
other compression algorithm by far, but the speed cost on both compression and
|
||||
decompression is extreme. At this size compression, lzo is interesting because
|
||||
it's faster than simply copying the file but only offers modest compression.
|
||||
What lrzip offers at this end of the spectrum is extreme compression if
|
||||
desired.
|
||||
|
||||
|
||||
Let's take two kernel trees one version apart as a tarball, linux-2.6.31 and
|
||||
linux-2.6.32-rc8. These will show lots of redundant information, but hundreds
|
||||
of megabytes apart, which lrzip will be very good at compressing. For
|
||||
simplicity, only 7z will be compared since that's by far the best general
|
||||
purpose compressor at the moment:
|
||||
|
||||
|
||||
Tarball of two kernel trees, one version apart.
|
||||
|
||||
Compression Size Percentage Compress Decompress
|
||||
None 749066240 100
|
||||
7z 108710624 14.5 4m4.260s 0m11.133s
|
||||
lrzip 57943094 7.7 3m08.788s 0m10.747s
|
||||
lrzip(lzo) 124029899 16.6 0m18.997s 0m7.107s
|
||||
|
||||
Things start getting very interesting now when lrzip is really starting to
|
||||
shine. Note how it's not that much larger for 2 kernel trees than it was for
|
||||
one. That's because all the similar data in both kernel trees is being
|
||||
compressed as one copy and only the differences really make up the extra size.
|
||||
All compression software does this, but not over such large distances. If you
|
||||
copy the same data over multiple times, the resulting lrzip archive doesn't
|
||||
get much larger at all.
|
||||
|
||||
Using the first example (linux-2.6.31.tar) and simply copying the data multiple
|
||||
times over gives these results with lrzip(lzo):
|
||||
|
||||
Copies Size Compressed Compress Decompress
|
||||
1 365711360 112151676 0m14.913s 0m5.063s
|
||||
2 731422720 112151829 0m16.174s 0m6.543s
|
||||
3 1097134080 112151832 0m17.466s 0m8.115s
|
||||
|
||||
|
||||
I had the amusing thought that this compression software could be used as a
|
||||
bullshit detector if you were to compress peoples' speeches because if their
|
||||
talks were full of catchphrases and not much actual content, it would all be
|
||||
compressed down. So the larger the final archive, the less bullshit =)
|
||||
|
||||
Now let's move on to the other special feature of lrzip, the ability to
|
||||
compress massive amounts of data on huge ram machines by using massive
|
||||
compression windows. This is a 10GB virtual image of an installed operating
|
||||
system and some basic working software on it. The default options on the
|
||||
8GB machine meant that it was using a 5 GB window.
|
||||
|
||||
|
||||
10GB Virtual image:
|
||||
|
||||
Compression Size Percentage Compress Time Decompress Time
|
||||
None 10737418240 100.0
|
||||
gzip 2772899756 25.8 7m52.667s 4m8.661s
|
||||
bzip2 2704781700 25.2 20m34.269s 7m51.362s
|
||||
xz 2272322208 21.2 58m26.829s 4m46.154s
|
||||
7z 2242897134 20.9 29m28.152s 6m35.952s
|
||||
lrzip 1361276826 12.7 27m45.874s 9m20.046
|
||||
lrzip(lzo) 1837206675 17.1 4m48.167s 8m28.842s
|
||||
lrzip(zpaq) 1341008779 12.5 4h11m14s
|
||||
lrzip(zpaq)M 1270134391 11.8 4h30m14
|
||||
lrzip(zpaq)MW 1066902006 9.9
|
||||
|
||||
At this end of the spectrum things really start to heat up. The compression
|
||||
advantage is massive, with the lzo backend even giving much better results
|
||||
than 7z, and over a ridiculously short time. Note that it's not much longer
|
||||
than it takes to just *read* a 10GB file. Unfortunately at these large
|
||||
compression windows, the decompression time is significantly longer, but
|
||||
it's a fair tradeoff I believe :) What appears to be a big disappointment is
|
||||
actually zpaq here which takes more than 8 times longer than lzma for a measly
|
||||
.2% improvement. The reason is that most of the advantage here is achieved by
|
||||
the rzip first stage. The -M option was included here for completeness to see
|
||||
what the maximum possible compression was for this file on this machine, while
|
||||
the MW run was with the options -W 200 (to make the window larger than the
|
||||
file and the ram the machine has), and it still completed but induced a lot
|
||||
of swap in the interim.
|
||||
|
||||
This should help govern what compression you choose. Small files are nicely
|
||||
compressed with zpaq. Intermediate files are nicely compressed with lzma.
|
||||
Large files get excellent results even with lzo provided you have enough ram.
|
||||
(Small being < 100MB, intermediate <1GB, large >1GB).
|
||||
Or, to make things easier, just use the default settings all the time and be
|
||||
happy as lzma gives good results. :D
|
||||
|
||||
Con Kolivas
|
||||
Sat, 19 Dec 2009
|
||||
118
doc/README.lzo_compresses.test.txt
Normal file
118
doc/README.lzo_compresses.test.txt
Normal file
|
|
@ -0,0 +1,118 @@
|
|||
An explanation of the revised lzo_compresses function in stream.c.
|
||||
|
||||
The modifications to the lrzip program for 0.19 centered around an
|
||||
attempt to catch data chunks that would cause lzma compression to either
|
||||
take an inordinately long time or not complete at all. The files that
|
||||
could cause problems for lzma are already-compressed files, multimedia
|
||||
files, files that have compressed files in them, and files with
|
||||
randomized data (such as an encrypted volume or file).
|
||||
|
||||
The lzo_compresses function is used to assess the data and return
|
||||
a TRUE or FALSE to the lzma_compress_buf function based on whether or
|
||||
not the function determined the data to be compressible or not. The
|
||||
simple formula cdata < odata was used (c=compressed, o=original).
|
||||
|
||||
Some test cases were slipping through and caused the hangups. Beginning
|
||||
with lrzip-0.19 a new option, -T, test compression threshold has been
|
||||
introduced and sets configurable limits as to what is considered a
|
||||
compressible data chunk and what is not.
|
||||
|
||||
In addition, with very large chunks of data, a small modification was
|
||||
made to the initial test buffer size to make it more representative of
|
||||
the entire sample.
|
||||
|
||||
To go along with this, increased verbosity was added to the function
|
||||
so that the user/evaluator can better see what is going on. -v or -vv
|
||||
can be used to increase informational output.
|
||||
|
||||
Functional Overview:
|
||||
|
||||
Data chunks are passed to the lzo_copresses function in two streams.
|
||||
The first is the small data set in the primary hashing bucket which
|
||||
can be seen when using the -v or -vv option. This is normally a small
|
||||
sample. The second stream will be the rest. The size of the streams
|
||||
are dependent on how the long range analysis that is performed on
|
||||
the entire file and available memory.
|
||||
|
||||
After analysis of the data chunk, a value of TRUE or FALSE is returned
|
||||
and lzma compression will either commence or be skipped. If skipped,
|
||||
data written out to the .lrz file will simply be the rzip data which
|
||||
is the reorganized data based on long range analysis.
|
||||
|
||||
The lzo_compresses function traverses through the data chunk comparing
|
||||
larger and larger blocks. If suitable compression ratios are found,
|
||||
the function ends and returns TRUE. If not, and the largest sample
|
||||
block size has been reached, the function will traverse deeper into
|
||||
the chunk and analyze that region. Anytime a compressible area is
|
||||
found, the function returns TRUE. When the end of the data chunk has
|
||||
been reached and no suitable compressible blocks found, the program
|
||||
will return FALSE.
|
||||
|
||||
Under most circumstances, this logic was fine. However, if the test
|
||||
found a chunk that could only achieve 2% compression, for example,
|
||||
this type of result could adversely affect the lzma compression
|
||||
routine. Hence, the concept of a limiting threshold.
|
||||
|
||||
The threshold option works as a limiter that forces the lzo_compresses
|
||||
function to not just compare the estimated compressed size with the
|
||||
original, but to add a limiting threshold. This ranges a very low
|
||||
threshold, 1, to a very strict, 10. A threshold of 1 means that for
|
||||
the function to return TRUE, the estimated compressed data size for
|
||||
the current data chunk can be between 90-100% of the original size.
|
||||
This means that almost no compressible data is observed or tested for.
|
||||
A value of 2, means that the data MUST compress better than 90% of
|
||||
the original size. However, if the observed compression of the data
|
||||
chunk is over 90% of the original size, then lzo_compresses will fail.
|
||||
|
||||
Each additional threshold value will increase the strictness according
|
||||
to the following formula
|
||||
|
||||
CDS = Observed Compressed Data Size from LZO
|
||||
ODS = Original Data chunk size
|
||||
T = Threshold
|
||||
|
||||
To return TRUE, CDS < ODS * (1.1-T/10)
|
||||
|
||||
At T=1, just 0.01% compression would be OK,
|
||||
T=2, anything better than 10% would be OK, but under 10% compression would fail.
|
||||
T=3, anything better 20% would be OK, but under 20% compression would fail.
|
||||
...
|
||||
T=10, I can't imagine a use for this. Anything better than 90% compression
|
||||
would be OK. This would imply that LZO would need to get a 10x compression
|
||||
ratio.
|
||||
|
||||
The following actual output from the lzo_compresses function will help
|
||||
explain.
|
||||
|
||||
22501 in primary bucket (0.805%)
|
||||
lzo testing for incompressible data...OK for chunk 43408.
|
||||
Compressed size = 52.58% of chunk, 1 Passes
|
||||
Progress percentage pausing during lzma compression...
|
||||
lzo testing for incompressible data...FAILED - below threshold for chunk 523245383.
|
||||
Compressed size = 98.87% of chunk, 50 Passes
|
||||
|
||||
This was for a video .VOB file of 1GB. A compression threshold of 2 was used.
|
||||
-T 2 means that the estimated compression size of the data chunk had to be
|
||||
better than 90% of the original size.
|
||||
|
||||
There were 43,408 bytes in the primary hash bucket and this chunk was
|
||||
evaluated by lzo_compresses. The function estimated that the compressed
|
||||
data size would be 52.58% of the original 43,408 byte chunk. This resulted
|
||||
in LZMA compression occurring.
|
||||
|
||||
The second data chunk which included the rest of the data in the current hash,
|
||||
523,245,383 bytes, failed the test. the lzo_compresses function made 50 passes
|
||||
through the data using progressively larger samples until it reached the end
|
||||
of the data chunk. It could not find better than a 1.2% compression benefit
|
||||
and therefore FAILED, The result was NO LZMA compression and the data chunk
|
||||
was written to the .lrz file in rzip format (no compression).
|
||||
|
||||
The higher the threshold option, the faster the LZMA compression will occur.
|
||||
However, this could also cause some chunks that are compressible to be
|
||||
omitted. After much testing, -T 2 seems to work very well in stopping data
|
||||
which will cause LZMA to hang yet allow most compressible data to come
|
||||
through.
|
||||
|
||||
Peter Hyman
|
||||
pete@peterhyman.com
|
||||
December 2007
|
||||
45
doc/lrzip.conf.example
Normal file
45
doc/lrzip.conf.example
Normal file
|
|
@ -0,0 +1,45 @@
|
|||
# lrzip.conf example file
|
||||
# anything beginning with a # or whitespace will be ignored
|
||||
# valid parameters are separated with an = and a value
|
||||
# parameters and values are not case sensitive
|
||||
#
|
||||
# lrzip 0.24, peter hyman, pete@peterhyman.com
|
||||
# ignored by earlier versions.
|
||||
|
||||
# Compression Window size in 100MB. Normally selected by program.
|
||||
WINDOW = 5
|
||||
|
||||
# Compression Level 1-9 (7 Default).
|
||||
COMPRESSIONLEVEL = 7
|
||||
|
||||
# Compression Method, rzip, gzip, bzip2, lzo, or lzma (default).
|
||||
# If specified here, command line options not usable.
|
||||
# COMPRESSIONMETHOD = lzo
|
||||
|
||||
# Test Threshold value 1-10 (2 Default).
|
||||
TESTTHRESHOLD = 2
|
||||
|
||||
# Default output directory
|
||||
# OUTPUTDIRECTORY = location
|
||||
|
||||
# Verbosity, true or 1, or max or 2
|
||||
VERBOSITY = max
|
||||
|
||||
# Show Progress as file is parsed, true or 1, false or 0
|
||||
SHOWPROGRESS = true
|
||||
|
||||
# Set Niceness. 19 is default. -20 to 19 is the allowable range
|
||||
NICE = 19
|
||||
|
||||
# Delete source file after compression
|
||||
# this parameter and value are case sensitive
|
||||
# value must be YES to activate
|
||||
|
||||
# DELETEFILES = NO
|
||||
|
||||
# Replace existing lrzip file when compressing
|
||||
# this parameter and value are case sensitive
|
||||
# value must be YES to activate
|
||||
|
||||
# REPLACEFILE = YES
|
||||
|
||||
41
doc/magic.header.txt
Normal file
41
doc/magic.header.txt
Normal file
|
|
@ -0,0 +1,41 @@
|
|||
lrzip-0.40+ file header format
|
||||
November 2009
|
||||
Con Kolivas
|
||||
|
||||
Byte Content
|
||||
0-3 LRZI
|
||||
4 LRZIP Major Version Number
|
||||
5 LRZIP Minor Version Number
|
||||
6-14 Source File Size
|
||||
16-20 LZMA Properties Encoded (lc,lp,pb,fb, and dictionary size)
|
||||
21-22 not used
|
||||
23-48 Stream 1 header data
|
||||
49-74 Stream 2 header data
|
||||
|
||||
Block Data:
|
||||
Byte:
|
||||
0 Compressed data type
|
||||
1-8 Compressed data length
|
||||
9-16 Uncompressed data length
|
||||
17-24 Next block head
|
||||
25+ Data
|
||||
|
||||
End:
|
||||
0-1 crc data
|
||||
|
||||
|
||||
lrzip-0.24+ file header format
|
||||
January 2009
|
||||
Peter Hyman, pete@peterhyman.com
|
||||
|
||||
Byte Content
|
||||
0-3 LRZI
|
||||
4 LRZIP Major Version Number
|
||||
5 LRZIP Minor Version Number
|
||||
6-9 Source File Size (no HAVE_LARGE_FILES)
|
||||
6-14 Source File Size
|
||||
16-20 LZMA Properties Encoded (lc,lp,pb,fb, and dictionary size)
|
||||
21-22 not used
|
||||
23-36 Stream 1 header data
|
||||
37-50 Stream 2 header data
|
||||
51 Compressed data type
|
||||
Loading…
Add table
Add a link
Reference in a new issue