Initial import

2026-04-05 06:15:28 +00:00 · 2010-03-29 10:07:08 +11:00 · 2010-03-29 10:07:08 +11:00 · 6dcceb0b1b
commit 6dcceb0b1b
parent 725e478e19
69 changed files with 26485 additions and 0 deletions
--- a/doc/README.Assembler
+++ b/doc/README.Assembler
@ -0,0 +1,44 @@
+README.Assembler
+
+Notes about CRC Assembly Language Coding.
+
+lrzip-0.21 makes use of an x86 assembly language file
+that optimizes CRC computation used in lrzip. It includes
+a wrapper C file, 7zCrcT8.c and the assembler code,
+7zCrcT8U.s.
+
+configure should detect your host system properly
+and adjust the Makefile accordingly. If you don't
+have the nasm assembler or have a ppc or other non-
+x86 system, the standard C CRC routines will be
+compiled and linked in.
+
+If for any reason configure does not properly
+detect your system type, or you do not want assembler
+modules to be compiled, you can run
+
+ASM=no ./configure
+
+which will automatically not include the asm module or
+change the line
+
+ASM_OBJ=7zCrcT8.o 7zCrcT8U.o
+to 
+ASM_OBJ=7zCrc.o
+
+in Makefile. This will change the dependency tree.
+
+To force assembly module compilation and linking (if
+configure does not detect your system type properly),
+type
+
+ASM=yes ./configure
+
+or change the Makefile to include the ASM_OBJ files
+as described above.
+
+
+Type `make clean' and then re-run make.
+
+Peter Hyman
+pete@peterhyman.com
--- a/doc/README.benchmarks
+++ b/doc/README.benchmarks
@ -0,0 +1,120 @@
+These are benchmarks performed on a 3GHz quad core Intel Core2 with 8GB ram
+using lrzip v0.42.
+
+The first comparison is that of a linux kernel tarball (2.6.31). In all cases
+the default options were used. 3 other common compression apps were used for
+comparison, 7z which is an excellent all-round lzma based compression app,
+gzip which is the benchmark fast standard that has good compression, and bzip2
+which is the most common linux used compression.
+
+In the following tables, lrzip means lrzip default options, lrzip(lzo) means
+lrzip using the lzo backend, lrzip(gzip) means using the gzip backend,
+lrzip(bzip2) means using the bzip2 backend and lrzip(zpaq) means using the zpaq
+backend.
+
+
+linux-2.6.31.tar
+
+Compression	Size		Percentage	Compress	Decompress
+None		365711360	100
+7z		53315279	14.6		2m4.770s	0m5.360s
+lrzip		52372722	14.3		2m48.477s	0m8.336s
+lrzip(zpaq)	43455498	11.9		10m11.335	10m14.296
+lrzip(lzo)	112151676	30.7		0m14.913s	0m5.063s
+lrzip(gzip)	73476127	20.1		0m29.628s	0m5.591s
+lrzip(bzip2)	60851152	16.6		0m43.539s	0m12.244s
+bzip2		62416571	17.1		0m44.493s	0m9.819s
+gzip		80563601	22.0		0m14.343s	0m2.781s
+
+
+These results are interesting to note the compression of lrzip by default is
+only slightly better than lzma, but at some cost in time at the compress and
+decompress end of the spectrum. Clearly zpaq compression is much better than any
+other compression algorithm by far, but the speed cost on both compression and
+decompression is extreme. At this size compression, lzo is interesting because
+it's faster than simply copying the file but only offers modest compression.
+What lrzip offers at this end of the spectrum is extreme compression if
+desired.
+
+
+Let's take two kernel trees one version apart as a tarball, linux-2.6.31 and
+linux-2.6.32-rc8. These will show lots of redundant information, but hundreds
+of megabytes apart, which lrzip will be very good at compressing. For
+simplicity, only 7z will be compared since that's by far the best general
+purpose compressor at the moment:
+
+
+Tarball of two kernel trees, one version apart.
+
+Compression	Size		Percentage	Compress	Decompress
+None		749066240	100
+7z		108710624	14.5		4m4.260s	0m11.133s
+lrzip		57943094	7.7		3m08.788s	0m10.747s
+lrzip(lzo)	124029899	16.6		0m18.997s	0m7.107s
+
+Things start getting very interesting now when lrzip is really starting to
+shine. Note how it's not that much larger for 2 kernel trees than it was for
+one. That's because all the similar data in both kernel trees is being
+compressed as one copy and only the differences really make up the extra size.
+All compression software does this, but not over such large distances. If you
+copy the same data over multiple times, the resulting lrzip archive doesn't
+get much larger at all.
+
+Using the first example (linux-2.6.31.tar) and simply copying the data multiple
+times over gives these results with lrzip(lzo):
+
+Copies		Size		Compressed	Compress	Decompress
+1		365711360	112151676	0m14.913s	0m5.063s
+2		731422720	112151829	0m16.174s	0m6.543s
+3		1097134080	112151832	0m17.466s	0m8.115s
+
+
+I had the amusing thought that this compression software could be used as a
+bullshit detector if you were to compress peoples' speeches because if their
+talks were full of catchphrases and not much actual content, it would all be
+compressed down. So the larger the final archive, the less bullshit =)
+
+Now let's move on to the other special feature of lrzip, the ability to
+compress massive amounts of data on huge ram machines by using massive
+compression windows. This is a 10GB virtual image of an installed operating
+system and some basic working software on it. The default options on the
+8GB machine meant that it was using a 5 GB window.
+
+
+10GB Virtual image:
+
+Compression	Size		Percentage	Compress Time	Decompress Time
+None		10737418240	100.0
+gzip		2772899756	 25.8		7m52.667s	4m8.661s
+bzip2		2704781700	 25.2		20m34.269s	7m51.362s
+xz		2272322208	 21.2		58m26.829s	4m46.154s
+7z		2242897134	 20.9		29m28.152s	6m35.952s
+lrzip		1361276826	 12.7		27m45.874s	9m20.046
+lrzip(lzo)	1837206675	 17.1		4m48.167s	8m28.842s
+lrzip(zpaq)	1341008779	 12.5		4h11m14s
+lrzip(zpaq)M	1270134391	 11.8		4h30m14
+lrzip(zpaq)MW	1066902006	  9.9
+
+At this end of the spectrum things really start to heat up. The compression
+advantage is massive, with the lzo backend even giving much better results
+than 7z, and over a ridiculously short time. Note that it's not much longer
+than it takes to just *read* a 10GB file. Unfortunately at these large
+compression windows, the decompression time is significantly longer, but
+it's a fair tradeoff I believe :) What appears to be a big disappointment is
+actually zpaq here which takes more than 8 times longer than lzma for a measly
+.2% improvement. The reason is that most of the advantage here is achieved by
+the rzip first stage. The -M option was included here for completeness to see
+what the maximum possible compression was for this file on this machine, while
+the MW run was with the options -W 200 (to make the window larger than the
+file and the ram the machine has), and it still completed but induced a lot
+of swap in the interim.
+
+This should help govern what compression you choose. Small files are nicely
+compressed with zpaq. Intermediate files are nicely compressed with lzma.
+Large files get excellent results even with lzo provided you have enough ram.
+(Small being < 100MB, intermediate <1GB, large >1GB).
+Or, to make things easier, just use the default settings all the time and be
+happy as lzma gives good results. :D
+
+Con Kolivas
+Sat, 19 Dec 2009
--- a/doc/README.lzo_compresses.test.txt
+++ b/doc/README.lzo_compresses.test.txt
@ -0,0 +1,118 @@
+An explanation of the revised lzo_compresses function in stream.c.
+
+The modifications to the lrzip program for 0.19 centered around an
+attempt to catch data chunks that would cause lzma compression to either
+take an inordinately long time or not complete at all. The files that
+could cause problems for lzma are already-compressed files, multimedia
+files, files that have compressed files in them, and files with
+randomized data (such as an encrypted volume or file).
+
+The lzo_compresses function is used to assess the data and return
+a TRUE or FALSE to the lzma_compress_buf function based on whether or
+not the function determined the data to be compressible or not. The
+simple formula cdata < odata was used (c=compressed, o=original).
+
+Some test cases were slipping through and caused the hangups. Beginning
+with lrzip-0.19 a new option, -T, test compression threshold has been
+introduced and sets configurable limits as to what is considered a
+compressible data chunk and what is not.
+
+In addition, with very large chunks of data, a small modification was
+made to the initial test buffer size to make it more representative of
+the entire sample.
+
+To go along with this, increased verbosity was added to the function
+so that the user/evaluator can better see what is going on. -v or -vv
+can be used to increase informational output.
+
+Functional Overview:
+
+Data chunks are passed to the lzo_copresses function in two streams.
+The first is the small data set in the primary hashing bucket which
+can be seen when using the -v or -vv option. This is normally a small
+sample. The second stream will be the rest. The size of the streams
+are dependent on how the long range analysis that is performed on
+the entire file and available memory.
+
+After analysis of the data chunk, a value of TRUE or FALSE is returned
+and lzma compression will either commence or be skipped. If skipped,
+data written out to the .lrz file will simply be the rzip data which
+is the reorganized data based on long range analysis.
+
+The lzo_compresses function traverses through the data chunk comparing
+larger and larger blocks. If suitable compression ratios are found,
+the function ends and returns TRUE. If not, and the largest sample
+block size has been reached, the function will traverse deeper into
+the chunk and analyze that region. Anytime a compressible area is
+found, the function returns TRUE. When the end of the data chunk has
+been reached and no suitable compressible blocks found, the program
+will return FALSE.
+
+Under most circumstances, this logic was fine. However, if the test
+found a chunk that could only achieve 2% compression, for example,
+this type of result could adversely affect the lzma compression
+routine. Hence, the concept of a limiting threshold.
+
+The threshold option works as a limiter that forces the lzo_compresses
+function to not just compare the estimated compressed size with the
+original, but to add a limiting threshold. This ranges a very low
+threshold, 1, to a very strict, 10. A threshold of 1 means that for
+the function to return TRUE, the estimated compressed data size for
+the current data chunk can be between 90-100% of the original size.
+This means that almost no compressible data is observed or tested for.
+A value of 2, means that the data MUST compress better than 90% of
+the original size. However, if the observed compression of the data
+chunk is over 90% of the original size, then lzo_compresses will fail.
+
+Each additional threshold value will increase the strictness according
+to the following formula
+
+CDS = Observed Compressed Data Size from LZO
+ODS = Original Data chunk size
+T = Threshold
+
+To return TRUE, CDS < ODS * (1.1-T/10)
+
+At T=1, just 0.01% compression would be OK,
+T=2, anything better than 10% would be OK, but under 10% compression would fail.
+T=3, anything better 20% would be OK, but under 20% compression would fail.
+...
+T=10, I can't imagine a use for this. Anything better than 90% compression
+would be OK. This would imply that LZO would need to get a 10x compression
+ratio.
+
+The following actual output from the lzo_compresses function will help
+explain.
+
+22501 in primary bucket (0.805%)
+        lzo testing for incompressible data...OK for chunk 43408.
+        Compressed size = 52.58% of chunk, 1 Passes
+        Progress percentage pausing during lzma compression...
+        lzo testing for incompressible data...FAILED - below threshold for chunk 523245383. 
+        Compressed size = 98.87% of chunk, 50 Passes
+
+This was for a video .VOB file of 1GB. A compression threshold of 2 was used.
+-T 2 means that the estimated compression size of the data chunk had to be
+better than 90% of the original size.
+
+There were 43,408 bytes in the primary hash bucket and this chunk was
+evaluated by lzo_compresses. The function estimated that the compressed
+data size would be 52.58% of the original 43,408 byte chunk. This resulted
+in LZMA compression occurring.
+
+The second data chunk which included the rest of the data in the current hash,
+523,245,383 bytes, failed the test. the lzo_compresses function made 50 passes
+through the data using progressively larger samples until it reached the end
+of the data chunk. It could not find better than a 1.2% compression benefit
+and therefore FAILED, The result was NO LZMA compression and the data chunk
+was written to the .lrz file in rzip format (no compression).
+
+The higher the threshold option, the faster the LZMA compression will occur.
+However, this could also cause some chunks that are compressible to be
+omitted. After much testing, -T 2 seems to work very well in stopping data
+which will cause LZMA to hang yet allow most compressible data to come
+through.
+
+Peter Hyman
+pete@peterhyman.com
+December 2007
--- a/doc/lrzip.conf.example
+++ b/doc/lrzip.conf.example
@ -0,0 +1,45 @@
+# lrzip.conf example file
+# anything beginning with a # or whitespace will be ignored
+# valid parameters are separated with an = and a value
+# parameters and values are not case sensitive
+# 
+# lrzip 0.24, peter hyman, pete@peterhyman.com
+# ignored by earlier versions.
+
+# Compression Window size in 100MB. Normally selected by program.
+WINDOW = 5
+
+# Compression Level 1-9 (7 Default).
+COMPRESSIONLEVEL = 7
+
+# Compression Method, rzip, gzip, bzip2, lzo, or lzma (default).
+# If specified here, command line options not usable.
+# COMPRESSIONMETHOD = lzo 
+
+# Test Threshold value 1-10 (2 Default).
+TESTTHRESHOLD = 2
+
+# Default output directory
+# OUTPUTDIRECTORY = location
+
+# Verbosity, true or 1, or max or 2
+VERBOSITY = max
+
+# Show Progress as file is parsed, true or 1, false or 0
+SHOWPROGRESS = true
+
+# Set Niceness. 19 is default. -20 to 19 is the allowable range
+NICE = 19
+
+# Delete source file after compression
+# this parameter and value are case sensitive
+# value must be YES to activate
+
+# DELETEFILES = NO
+
+# Replace existing lrzip file when compressing
+# this parameter and value are case sensitive
+# value must be YES to activate
+
+# REPLACEFILE = YES 
+
--- a/doc/magic.header.txt
+++ b/doc/magic.header.txt
@ -0,0 +1,41 @@
+lrzip-0.40+ file header format
+November 2009
+Con Kolivas
+
+Byte	Content
+0-3	LRZI
+4	LRZIP Major Version Number
+5	LRZIP Minor Version Number
+6-14	Source File Size
+16-20	LZMA Properties Encoded (lc,lp,pb,fb, and dictionary size)
+21-22	not used
+23-48	Stream 1 header data
+49-74	Stream 2 header data
+
+Block Data:
+Byte:
+0	Compressed data type
+1-8	Compressed data length
+9-16	Uncompressed data length
+17-24	Next block head
+25+	Data
+
+End:
+0-1	crc data
+
+
+lrzip-0.24+ file header format
+January 2009
+Peter Hyman, pete@peterhyman.com
+
+Byte	Content
+0-3	LRZI
+4	LRZIP Major Version Number
+5	LRZIP Minor Version Number
+6-9	Source File Size (no HAVE_LARGE_FILES)
+6-14	Source File Size
+16-20	LZMA Properties Encoded (lc,lp,pb,fb, and dictionary size)
+21-22	not used
+23-36	Stream 1 header data
+37-50	Stream 2 header data
+51	Compressed data type