Instead of failing completely, detect when a failure has occurred, and serialise work for that thread to decrease the memory required to complete compression / decompression.
Do that by waiting for the thread before it to complete before trying to work on that block again.
Check internally when lzma compress has failed and try a lower compression level before falling back to bzip2 compression.
This overlapping of compressing streams means that when files are large enough to be split into multiple blocks, all CPUs will be used more effectively throughout the compression, affording a nice speedup.
Move the writing of the chunk byte size and initial headers into the compthread to prevent any races occurring.
Fix a few dodgy callocs that may have been overflowing!
The previous commit reverts were done because the changes designed to speed it up actually slowed it down instead.
Scale the 9 levels down to the 7 that lzma has.
This makes the default lzma compression level 5 which is what lzma normally has, and uses a lot less ram and is significantly faster than previously, but at the cost of giving slightly less compression.
Change suggested maximum compression in README to disable threading with -p 1.
Use bzip2 as a fallback compression when lzma fails due to internal memory errors as may happen on 32 bits.
Limit lzma windows to 300MB in the right place on 32 bit only.
Make the main process less nice than the backend threads since it tends to be the rate limiting step.
Choose sane defaults for memory usage since linux ludicriously overcommits.
Use sliding mmap for any compression windows greater than 2/3 ram.
Consolidate and simplify testing of allocatable ram.
Minor tweaks to output.
Round up the size of the high buffer in sliding mmap to one page.
Squeeze a little more out of 32 bit compression windows.
This is done by taking each stream of data on read in into separate buffers for up to as many threads as CPUs.
As each thread's data becomes available, feed it into runzip once it is requests more of the stream.
Provided there are enough chunks in the originally compressed data, this provides a massive speedup potentially proportional to the number of CPUs. The slower the backend compression, the better the speed up (i.e. zpaq is the best sped up).
Fix the output of zpaq compress and decompress from trampling on itself and racing and consuming a lot of CPU time printing to the console.
When limiting cwindow to 6 on 32 bits, ensure that control.window is also set.
When testing for the maximum size of testmalloc, the multiple used was out by one, so increase it.
Minor output tweaks.