Re: thoughts on memory usage...

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Wed, 20 Aug 1997 17:53:34 +0200 (EETDST)

--MimeMultipartBoundary
Content-type: text/plain; charset=US-ASCII
Content-transfer-encoding: 7BIT

On 20 Aug 97 at 20:08, squid-dev@nlanr.net wrote:
> From: David Luyer <luyer@ucs.uwa.edu.au>
>
> On Wed, 20 Aug 1997, Andres Kroonmaa wrote:
> > If hash.c is good enough, we'd need no entry->url altogether, successful
> > hash lookup means we have have an url hit. Request URL would be kept in ram
> > only as long as request is being serviced.
>
> 1) no hash is perfect, or it's a compression technique (even if it is
> extremely hard to reverse :). even if the md5sums of two url strings are
> the same, we'd still have to lookup the url in the on-disk file to be
> sure.

    I pretty much agree. Compressing URL's is robust and straight-forward
 approach and it sure works. Still, there are two distinct approaches -
 1) if we do need to preserve original URL, then compression is the only
    choice to reduce RAM.
 2) if do not need to preserve URL, we must work out how to make sure
    that we "one-way-compress" URL so that we would "never" have "collision".

 If we have 1) then there is not much to talk about.
 If we have 2) then we can:
    use two different hash functions and their output combined defines a
    match. Possibility that 2 different URL's yield same 2 hashes with 2
    different algoritms is incredibly small.

 See, if we need only yes/no for match, we do not need to keep all data
 for URL, and we can achieve much higher compression.

 If we use md5 twice (for different parts of URL, for eg.) we have constant
 32 bytes per URL no matter what URL, (first lookup finds a hash, then
 inside this hash compare 2nd part with that stored inside the hash. I
 guess this is how squid currently works, having the 2nd check item the
 URL itself.) And if Squid uses its own hash algoritm already that is
 part of url match lookup, then we'd need only one md5 hash to cmp to.

 We get 50% compression (I see 66 bytes average per URL here) and a bonus
 that our hash structures are fixed length - why not use binary format
 for swaplog file (dbm if you like)? If we can do with single md5 then
 I'd have compression ratio of 1:4 (and this can mean 75MB of pure RAM)

> 2) every ICP query would require in this case an md5 encoding, an open()
> of an on-disk file to verify the true url, etc. giving an ICP response
> for the wrong URL and then realising it later is really unaccaptable in a
> peering hierachy, esp if you don't allow miss_access and/or you have a
> slow upstream link.

    Hash functions to be used should be good enough to have collisions
 more rare than age-control appears in http headers. I bet that single
 md5 will have much less probabilty to have miss-access than current
 problems with age-controls.

> Using a primitive compression (like the one I've described or better, or
> using slf's scheme where you're essentially "compressing" a URL into a
> 4-byte index which is really a key into a tree structure) gives you a
> guarenteed, reversible and one-to-one mapping for any URL. This means
> that you don't have to store the real URL anywhere (except in the on-disk
> log file if you plan on changing the compression technique between
> versions; of course this aspect could also be achieved with
> backwards-compatibility logfile-reading routines or a conversion script,
> with the encoded URLs stored on disk).
>
> Once I ironed all the bugs out of my encoding method it encodes 65.5M of
> URLs (with trailing NUL) into 44.3M of space (32.4% saving). This is with

    No doubt your compression idea is right, and maybe this hash stuff
 simply doesn't work. Its just my thought that if we actually do not need
 to keep decodable URL's in RAM, then we can go much further and do whatever
 we like to compress this extra data. No need to use reversible functions.

 I any case, one way or other, there is a need to reduce squid ram usage,
 and URL compression is one way to go.

 regards,

 ----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
 ----------------------------------------------------------------------

--MimeMultipartBoundary--
Received on Tue Jul 29 2003 - 13:15:42 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:23 MST