Re: thoughts on memory usage...

From: Michael O'Reilly <michael@dont-contact.us>
Date: 21 Aug 1997 14:54:57 +0800

David Luyer <luyer@ucs.uwa.edu.au> writes:

> Agreed, MD5 or HSA (HSA is the default/preferred algorithm in the Linux
> kernel, but I'm not sure if that's just because of cryptographic security,
> patent restrictions, performance or whatever... and we'd be more worried
> about a combination of performance and collision probability) would be
> "good enough". I don't think it would be acceptable if we had to
> double-check the URL on disk for every ICP hit (an open() and read()
> before even returning the hit, then probably close the file and re-open()
> later), but MD5/HSA should handle URL hashing enough to ignore the
> possibility of collision (until someone has a cache with more URLs in it
> than there are currently in the world, by many factors of 10).

Well, looking at the web page, it doesn't look like hash collision is
really something we'd need to worry about. We're safe by about 20
orders of magnitude... :) (with 16 byte hashs)
 
> This leaves the question of what to do with the URL. Can you just throw
> it away? Well... it would certianly be nice to have a fixed structure
> fixed record length "log" file. One (obvious?) problem I can think of tho
> is the removal of old items from the cache. Unless cache purging is to be
> done purely on an LRU or similar basis (hmmm, decline page usefulness
> every X hours by some constant (8? 50?), increase it by 1 every hit... or
> some non-linear function? the way the page/buffer cache in linux
> works...). When a new request for the URL is recieved, it can be decided
> if the object is out of date or not since you now have the real URL.

Or you can tag the object with the appropriate rule when you build it?
i.e. You get a request for foobar.gif, you build the object, store it
on disk, and then run down the refresh_rules to see which one matches,
and then say "rule 6 is it for this one!", and voilo!

Actually I think I can see all sorts of wild advantages to that....

> The idea of storing extra metadata in the objects in the cache is
> interesting (the putting the url at the beginning). Allowing for a
> rebuild from data, although slow, after re-arranging the cache or whatever
> would be nice, but if we check the on-disk URL all the time then it's not
> so nice because of the performance of ICP queries (or maybe just give a
> false "yes" and then return a tcp denied and fix up squid in a way that
> it deals with tcp denied by retrying the request from a different
> peer/parent?). Basically, I like the idea of keeping the URL on disk
> unless it's actually used all the time (ie, "it's there, why not use it,
> MD5/HSA _could_ be wrong you know" is not the right attitude for ICP
> queries).

In for a penny, in for a pound. If you trust MD5, which bother sending
the entire URL over in the ICP? that's just a waste of bandwidth and
CPU time. Just send over the md5 signature.... half a :)

> * the URLs could be kept on disk as the first line of the cache swap file.
> it would nuke all existing caches, but, this would only be in the
> upgrade to 1.2 for most users, and if they are patient enough they
> *could* run a script... this would then mean that people wouldn't loose
> their cache in a loss of "log"; maybe the format of the first line of a
> swap file should be the format of a full line of the current "log" file.
> it wouldn't increase real disk usage or I/O in most cases since real
> hardware talks in typically 512-byte... 4k or more blocks.

Still not sure why you need the URLs that much. why not write them to
a seperate log? if reindexing is an oddity, then you're not really
fussed about the speed....

Michael.
Received on Tue Jul 29 2003 - 13:15:42 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:24 MST