Re: Digest diff compression

From: Alex Rousskov <rousskov@dont-contact.us>
Date: Sun, 5 Sep 1999 15:49:44 -0600 (MDT)

Henrik,

        Thanks for the implementation and detailed description! I am
glad the "real" numbers are at least as good as the simulated ones.
Since your approach offers a trade-off (you trade RAM and CPU time for
diff size), do you have any data illustrating the trade-off? For
example, how much memory (at peak) your algorithm needs to encode the
entire digest versus the portion of changed bits. Also, would that
memory depend on the digest size (the number of changed bits)?

        Since the baseline algorithm is very simple, requires just a few
bytes of RAM, and is as fast as it gets, I am trying to understand what
to do next. If performance advantages are marginal, perhaps we can stay
with a simple approach. If they are not, perhaps the compression
algorithm should be "configurable" so that somebody can add more
algorithms. For example, one might implement "gzip -9 < diff" to get 30%
more...

        Personally, I would hate having compression algorithm as a
protocol parameter because that would require every cache to implement
many decompression algorithms. Ideally, I would like to see a single
algorithm (with optimization parameters if needed). Or, to be precise, I
would prefer to have a single _decompression_ algorithm (with parameters
if needed). The compression algorithm does not really matter.

Thanks,

Alex.

On Sun, 5 Sep 1999, Henrik Nordstrom wrote:

> Hi fellow Squid hackers.
>
> I have now completed and verified the prototype for my digest diff
> compression.
>
> Attached is a "simulator" which implements both compression and
> decompression, and verifies that the decompressed output equals the
> input. Program output compatible with the NLANR cd_diff.c program as
> required by the task description. The program expects MIME input as
> before. Remove the call to skipHttpHeader if you have data files without
> a MIME header.
>
> This code is mostly a prof of concept and not at all suitable for
> inclusion in Squid. The code is about as poorly designed as the previous
> one, very inefficient with both CPU and memory, and has a lot of extra
> validations to verify that the operation produces correct output. The
> good news is that the algorithms behind it is very easy to divide up in
> small timeframes, and there is at least two different possible basic
> approaches with different CPU/memory priority.
>
> The encoding consists of three "independent" steps:
>
> 1. Calculate bit distances between changed bits, encoded as 16 bit
> integers
> 2. Huffman encoding of these 16 bit integers.
> 3. A compact way to represent the histogram used to build the huffman
> code.
>
> If memory is a premium then steps 1 & 2 can be combined in two passes,
> pass 1 to build the huffman code, pass 2 to produce the huffman output.
>
> This version produces slightly better results than the previously
> distributed simulator due to the more efficient coding of the histogram
> information.
>
> -------------------------------------------
> %bits | My | NLANR | % |
> changed | program | cd_diff | improvement |
> --------+---------+---------+-------------|
> 0.216% | 2.32% | 3.46% | 33.1% |
> 0.330% | 3.28% | 5.22% | 37.2% |
> 0.440% | 4.16% | 6.94% | 40.1% |
> 0.553% | 5.04% | 8.67% | 41.8% |
> 1.006% | 8.19% | 15.53% | 47.2% |
> 2.020% | 14.34% | 30.10% | 52.3% |
> 5.020% | 28.91% | 67.45% | 57.1% |
> 10.024% | 47.34% |114.11% | 58.5% |
> 20.950% | 74.63% |169.48% | 55.9% |
> 28.670% | 87.04% |186.52% | 53.3% |
> -------------------------------------------
>
> %bits changed is difference between two digests. Only first ones is
> relevant for Squid, the higher values are included to show what happens
> at extremes.
>
> Detailed description of each step:
>
> 1. Bit distances encoded as 16 bit integers
>
> The first step is to calculate the bit distance between each changed bit
> and encode this with a simple encoding resulting in 16 bit integer
> outputs. Each integer carries 15 bits of information and the 16'th bit
> is a roll-over bit to allow encoding of arbitrary distances (0-32767 is
> one word, 32768-1073741823 two words, ...). For the final implementation
> it needs to be decided on word ordering (most significant word first or
> least significant word first). Byte ordering dependencies within
> individual words is removed by the huffman encoding if the output is
> huffman encoded.
>
> 2. Huffman encoding of the 16 bit integers
>
> Not much to say. Plain huffman encoding of 16 bit integers.
>
> 3. A compact way to represent the count histogram used to build the
> huffman code.
>
> This is perhaps the most complex part of the format, and utilizes some
> "known facts" on the frequency distribution. It consists of counts for
> each bit distance, and encoded as array of {bit distance size offset,
> count offset} with a adaptive bit length encoding. The weights used in
> the adaptive bit encoding is not fully optimized and it may be possible
> to squeeze out a few bytes more here (perhaps in the range of 100bytes
> more). The current selection is more a result of trial and error than
> actual research.
>
> I will work on documenting this in more detail.
>
> For a implementation in Squid it would help if someone located (or
> implemented) an efficient block based huffman encoder/decoder for 16 bit
> integers with a known frequency table.
>
> I am afraid I won't have much time to spend on this the next few weeks.
> Good news it that I am now hired to do some Squid related work.
>
> /Henrik
>
Received on Tue Jul 29 2003 - 13:16:00 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:17 MST