lzip: Quality assurance
4 Design, development, and testing of lzip
******************************************
There are two ways of constructing a software design: One way is to make it
so simple that there are obviously no deficiencies and the other way is to
make it so complicated that there are no obvious deficiencies. The first
method is far more difficult.
-- C.A.R. Hoare
Lzip is developed by volunteers who lack the resources required for
extensive testing in all circumstances. It is up to you to test lzip before
using it in mission-critical applications. However, a compressor like lzip
is not a toy, and maintaining it is not a hobby. Many people's data depend
on it. Therefore the lzip file format has been reviewed carefully and is
believed to be free from negligent design errors.
Lzip has been designed, written, and tested with great care to replace
gzip and bzip2 as the standard general-purpose compressed format for
unix-like systems. This chapter describes the lessons learned from these
previous formats, and their application to the design of lzip.
4.1 Format design
=================
When gzip was designed in 1992, computers and operating systems were much
less capable than they are today. The designers of gzip tried to work around
some of those limitations, like 8.3 file names, with additional fields in
the file format.
Today those limitations have mostly disappeared, and the format of gzip
has proved to be unnecessarily complicated. It includes fields that were
never used, others that have lost their usefulness, and finally others that
have become too limited.
Bzip2 was designed 5 years later, and its format is simpler than the one
of gzip.
Probably the worst defect of the gzip format from the point of view of
data safety is the variable size of its header. If the byte at offset 3
(flags) of a gzip member gets corrupted, it may become difficult to recover
the data, even if the compressed blocks are intact, because it can't be
known with certainty where the compressed blocks begin.
By contrast, the header of a lzip member has a fixed length of 6. The
LZMA stream in a lzip member always starts at offset 6, making it trivial to
recover the data even if the whole header becomes corrupt.
Bzip2 also provides a header of fixed length and marks the begin and end
of each compressed block with six magic bytes, making it possible to find
the compressed blocks even in case of file damage. But bzip2 does not store
the size of each compressed block, as lzip does.
Lziprecover is able to provide unique data recovery capabilities because
the lzip format is extraordinarily safe. The simple and safe design of the
file format complements the embedded error detection provided by the LZMA
data stream. Any distance larger than the dictionary size acts as a
forbidden symbol, allowing the decompressor to detect the approximate
position of errors, and leaving very little work for the check sequence
(CRC and data sizes) in the detection of errors. Lzip is usually able to
detect all possible bit flips in the compressed data without resorting to
the check sequence. It would be difficult to write an automatic recovery
tool like lziprecover for the gzip format. And, as far as I know, it has
never been written.
Lzip, like gzip and bzip2, uses a CRC32 to check the integrity of the
decompressed data because it provides optimal accuracy in the detection of
errors up to a compressed size of about 16 GiB, a size larger than that of
most files. In the case of lzip, the additional detection capability of the
decompressor reduces the probability of undetected errors several million
times more, resulting in a combined integrity checking optimally accurate
for any member size produced by lzip. Preliminary results suggest that the
lzip format is safe enough to be used in critical safety avionics systems.
The lzip format is designed for long-term archiving. Therefore it
excludes any unneeded features that may interfere with the future
extraction of the decompressed data.
4.1.1 Gzip format (mis)features not present in lzip
---------------------------------------------------
'Multiple algorithms'
Gzip provides a CM (Compression Method) field that has never been used
because it is a bad idea to begin with. New compression methods may
require additional fields, making it impossible to implement new
methods and, at the same time, keep the same format. This field does
not solve the problem of format proliferation; it just makes the
problem less obvious.
'Optional fields in header'
Unless special precautions are taken, optional fields are generally a
bad idea because they produce a header of variable size. The gzip
header has 2 fields that, in addition to being optional, are
zero-terminated. This means that if any byte inside the field gets
zeroed, or if the terminating zero gets altered, gzip won't be able to
find neither the header CRC nor the compressed blocks.
'Optional CRC for the header'
Using an optional CRC for the header is not only a bad idea, it is an
error; it circumvents the Hamming distance (HD) of the CRC and may
prevent the extraction of perfectly good data. For example, if the CRC
is used and the bit enabling it is reset by a bit flip, the header
will appear to be intact (in spite of being corrupt) while the
compressed blocks will appear to be totally unrecoverable (in spite of
being intact). Very misleading indeed.
'Metadata'
The gzip format stores some metadata, like the modification time of the
original file or the operating system on which compression took place.
This complicates reproducible compression (obtaining identical
compressed output from identical input).
4.1.2 Lzip format improvements over gzip and bzip2
--------------------------------------------------
'64-bit size field'
Probably the most frequently reported shortcoming of the gzip format
is that it only stores the least significant 32 bits of the
uncompressed size. The size of any file larger than 4 GiB gets
truncated.
Bzip2 does not store the uncompressed size of the file.
The lzip format provides a 64-bit field for the uncompressed size.
Additionally, lzip produces multimember output automatically when the
size is too large for a single member, allowing for an unlimited
uncompressed size.
'Distributed index'
The lzip format provides a distributed index that, among other things,
helps plzip to decompress several times faster than pigz and helps
lziprecover do its job. Neither the gzip format nor the bzip2 format
do provide an index.
A distributed index is safer and more scalable than a monolithic
index. The monolithic index introduces a single point of failure in
the compressed file and may limit the number of members or the total
uncompressed size.
4.2 Quality of implementation
=============================
'Accurate and robust error detection'
The lzip format provides 3 factor integrity checking, and the
decompressors report mismatches in each factor separately. This method
detects most false positives for corruption. If just one byte in one
factor fails but the other two factors match the data, it probably
means that the data are intact and the corruption just affects the
mismatching factor (CRC, data size, or member size) in the member
trailer.
'Multiple implementations'
Just like the lzip format provides 3 factor protection against
undetected data corruption, the development methodology of the lzip
family of compressors provides 3 factor protection against undetected
programming errors.
Three related but independent compressor implementations, lzip, clzip,
and minilzip/lzlib, are developed concurrently. Every stable release
of any of them is tested to verify that it produces identical output
to the other two. This guarantees that all three implement the same
algorithm, and makes it unlikely that any of them may contain serious
undiscovered errors. In fact, no errors have been discovered in lzip
since 2009.
Additionally, the three implementations have been extensively tested
with unzcrash, valgrind, and 'american fuzzy lop' without finding a
single vulnerability or false negative. ⇒Unzcrash
(lziprecover)Unzcrash.
'Dictionary size'
Lzip automatically adapts the dictionary size to the size of each file.
In addition to reducing the amount of memory required for
decompression, this feature also minimizes the probability of being
affected by RAM errors during compression.
'Exit status'
Returning a warning status of 2 is a design flaw of compress that
leaked into the design of gzip. Both bzip2 and lzip are free from this
flaw.