Re: thoughts on memory usage... from Oskar Pearson on 1997-08-20 (squid-dev)

From: Oskar Pearson <oskar@dont-contact.us>
Date: Wed, 20 Aug 1997 10:53:31 +0200

--MimeMultipartBoundary
Content-Type: text/plain; charset=us-ascii

> URL string is needed only for logging, and finding actual source, thus
> only while request is serviced. I don't seem to find any other uses for
Agreed.

> keeping actual URL string in ram. Squid is request driven, and it doesn't
> care very much of what is in its cache for other times.
> For URL search all we need is a uniq identifier that can be calculated
> from any given URL. thus we'd need an alogritm that gives always a uniq
what is 'hash.c' for if this isn't what it's already doing?

Are we essentially keeping the whole URL in ram so that we can eliminate
hash table collisions?

> (hash) id from any possible URL. It could be 64 bits, or whatever is
> possible to make it uniq enough. This algoritm could well be
> non-reversible.

Possible problems:

The hash table would have to be dynamically sized so that on a 1 gig
cache it won't use as much ram as on a 10 gig cache. This is since
you essentially need 1 'hole' in the hash table for each object.

> Upon request we'd calc a uniq hash id from URL and make a lookup.
How unique.... shall we use md5?

> I don't know if it is possible to calc a uniq id from any url in such a
> way that no two different URL's would yield the same id, but I believe
> that "collisions" could be made extremely rare.
but potentially disasterous. Quite often a share price page from yesterday
only differs by one digit to the share price page for today. If the hash
is cruddy it could mean that they collide, and we get fired :)

> ICP could use these cryptic ID's to ask for hits from peering caches,
> (if they have negotiated to use the same algoritm), reducing ICP traffic
> and remote cpu usage.
Agreed - the ICP stuff is pretty inefficient for loooong urls. If you
had a precomputed hash already (from the arriving request) and then
simply pass that to the other hosts it would be a lot more efficient. You
could also include the other headers (excluding silly ones like 'user-agent'
stuff) it would mean that you could make an ICP request that relies
on certain headers and get the right page... I think that this was mentioned
as a problem with the current ICP implementation in some of the documentation.

> As no place on disk would contain actual URL for which the id was
> made, it could be very difficult to change algoritm if the need arises.
> Also it would be hard to detect when collisions occur. To doublecheck,
> I'd suggest to prepend URL to any object on disk. Then, when servicing
Agreed. I think that it's perhaps time that we start keeping extra info
in the on-disk cache... for example 'URL','headers'(eg cookies),'time
retrieved'

> In addition, saving URL's with objects gives a way to rebuild all store
> data from files spread on disks in case swaplog gets trashed or corrupted.
doing 1/2 a million 'open,seek,close's on every object in the cache will
be slooooow.

> In conclusion, if this idea is worth anything, squid RAM usage could
> drop from average 100 bytes per URL to 6-10, giving more ram and speeding
> up lookups.
I think that md5 is the 'way to go'. Using md5 would mean that we essentially
don't have to worry about collisions... but it's CPU intensive.

Oskar

--MimeMultipartBoundary--
Received on Tue Jul 29 2003 - 13:15:42 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:23 MST