mirror of
https://github.com/ckolivas/lrzip.git
synced 2025-12-06 07:12:00 +01:00
119 lines
5.6 KiB
Plaintext
119 lines
5.6 KiB
Plaintext
|
|
An explanation of the revised lzo_compresses function in stream.c.
|
||
|
|
|
||
|
|
The modifications to the lrzip program for 0.19 centered around an
|
||
|
|
attempt to catch data chunks that would cause lzma compression to either
|
||
|
|
take an inordinately long time or not complete at all. The files that
|
||
|
|
could cause problems for lzma are already-compressed files, multimedia
|
||
|
|
files, files that have compressed files in them, and files with
|
||
|
|
randomized data (such as an encrypted volume or file).
|
||
|
|
|
||
|
|
The lzo_compresses function is used to assess the data and return
|
||
|
|
a TRUE or FALSE to the lzma_compress_buf function based on whether or
|
||
|
|
not the function determined the data to be compressible or not. The
|
||
|
|
simple formula cdata < odata was used (c=compressed, o=original).
|
||
|
|
|
||
|
|
Some test cases were slipping through and caused the hangups. Beginning
|
||
|
|
with lrzip-0.19 a new option, -T, test compression threshold has been
|
||
|
|
introduced and sets configurable limits as to what is considered a
|
||
|
|
compressible data chunk and what is not.
|
||
|
|
|
||
|
|
In addition, with very large chunks of data, a small modification was
|
||
|
|
made to the initial test buffer size to make it more representative of
|
||
|
|
the entire sample.
|
||
|
|
|
||
|
|
To go along with this, increased verbosity was added to the function
|
||
|
|
so that the user/evaluator can better see what is going on. -v or -vv
|
||
|
|
can be used to increase informational output.
|
||
|
|
|
||
|
|
Functional Overview:
|
||
|
|
|
||
|
|
Data chunks are passed to the lzo_copresses function in two streams.
|
||
|
|
The first is the small data set in the primary hashing bucket which
|
||
|
|
can be seen when using the -v or -vv option. This is normally a small
|
||
|
|
sample. The second stream will be the rest. The size of the streams
|
||
|
|
are dependent on how the long range analysis that is performed on
|
||
|
|
the entire file and available memory.
|
||
|
|
|
||
|
|
After analysis of the data chunk, a value of TRUE or FALSE is returned
|
||
|
|
and lzma compression will either commence or be skipped. If skipped,
|
||
|
|
data written out to the .lrz file will simply be the rzip data which
|
||
|
|
is the reorganized data based on long range analysis.
|
||
|
|
|
||
|
|
The lzo_compresses function traverses through the data chunk comparing
|
||
|
|
larger and larger blocks. If suitable compression ratios are found,
|
||
|
|
the function ends and returns TRUE. If not, and the largest sample
|
||
|
|
block size has been reached, the function will traverse deeper into
|
||
|
|
the chunk and analyze that region. Anytime a compressible area is
|
||
|
|
found, the function returns TRUE. When the end of the data chunk has
|
||
|
|
been reached and no suitable compressible blocks found, the program
|
||
|
|
will return FALSE.
|
||
|
|
|
||
|
|
Under most circumstances, this logic was fine. However, if the test
|
||
|
|
found a chunk that could only achieve 2% compression, for example,
|
||
|
|
this type of result could adversely affect the lzma compression
|
||
|
|
routine. Hence, the concept of a limiting threshold.
|
||
|
|
|
||
|
|
The threshold option works as a limiter that forces the lzo_compresses
|
||
|
|
function to not just compare the estimated compressed size with the
|
||
|
|
original, but to add a limiting threshold. This ranges a very low
|
||
|
|
threshold, 1, to a very strict, 10. A threshold of 1 means that for
|
||
|
|
the function to return TRUE, the estimated compressed data size for
|
||
|
|
the current data chunk can be between 90-100% of the original size.
|
||
|
|
This means that almost no compressible data is observed or tested for.
|
||
|
|
A value of 2, means that the data MUST compress better than 90% of
|
||
|
|
the original size. However, if the observed compression of the data
|
||
|
|
chunk is over 90% of the original size, then lzo_compresses will fail.
|
||
|
|
|
||
|
|
Each additional threshold value will increase the strictness according
|
||
|
|
to the following formula
|
||
|
|
|
||
|
|
CDS = Observed Compressed Data Size from LZO
|
||
|
|
ODS = Original Data chunk size
|
||
|
|
T = Threshold
|
||
|
|
|
||
|
|
To return TRUE, CDS < ODS * (1.1-T/10)
|
||
|
|
|
||
|
|
At T=1, just 0.01% compression would be OK,
|
||
|
|
T=2, anything better than 10% would be OK, but under 10% compression would fail.
|
||
|
|
T=3, anything better 20% would be OK, but under 20% compression would fail.
|
||
|
|
...
|
||
|
|
T=10, I can't imagine a use for this. Anything better than 90% compression
|
||
|
|
would be OK. This would imply that LZO would need to get a 10x compression
|
||
|
|
ratio.
|
||
|
|
|
||
|
|
The following actual output from the lzo_compresses function will help
|
||
|
|
explain.
|
||
|
|
|
||
|
|
22501 in primary bucket (0.805%)
|
||
|
|
lzo testing for incompressible data...OK for chunk 43408.
|
||
|
|
Compressed size = 52.58% of chunk, 1 Passes
|
||
|
|
Progress percentage pausing during lzma compression...
|
||
|
|
lzo testing for incompressible data...FAILED - below threshold for chunk 523245383.
|
||
|
|
Compressed size = 98.87% of chunk, 50 Passes
|
||
|
|
|
||
|
|
This was for a video .VOB file of 1GB. A compression threshold of 2 was used.
|
||
|
|
-T 2 means that the estimated compression size of the data chunk had to be
|
||
|
|
better than 90% of the original size.
|
||
|
|
|
||
|
|
There were 43,408 bytes in the primary hash bucket and this chunk was
|
||
|
|
evaluated by lzo_compresses. The function estimated that the compressed
|
||
|
|
data size would be 52.58% of the original 43,408 byte chunk. This resulted
|
||
|
|
in LZMA compression occurring.
|
||
|
|
|
||
|
|
The second data chunk which included the rest of the data in the current hash,
|
||
|
|
523,245,383 bytes, failed the test. the lzo_compresses function made 50 passes
|
||
|
|
through the data using progressively larger samples until it reached the end
|
||
|
|
of the data chunk. It could not find better than a 1.2% compression benefit
|
||
|
|
and therefore FAILED, The result was NO LZMA compression and the data chunk
|
||
|
|
was written to the .lrz file in rzip format (no compression).
|
||
|
|
|
||
|
|
The higher the threshold option, the faster the LZMA compression will occur.
|
||
|
|
However, this could also cause some chunks that are compressible to be
|
||
|
|
omitted. After much testing, -T 2 seems to work very well in stopping data
|
||
|
|
which will cause LZMA to hang yet allow most compressible data to come
|
||
|
|
through.
|
||
|
|
|
||
|
|
Peter Hyman
|
||
|
|
pete@peterhyman.com
|
||
|
|
December 2007
|