Re: conflict of interest from David Luyer on 1997-11-04 (squid-dev)

From: David Luyer <luyer@dont-contact.us>
Date: Wed, 05 Nov 1997 11:26:32 +0800

--MimeMultipartBoundary
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

Henrik Nordstr=F6m wrote:
> Duane Wessels wrote:
> > The past couple of days I have been munging squid 1.2 to support usin=
g
> > SHA digests as cache keys. This is of course a result of the
> > discussion some months ago about compressing URLs which then became
> > 'why not just use MD5 hashes?' FWIW, I started with SHA instead of MD=
5
> > because some people hinted it might be better and has no use
> > restrictions. However, it will be easy to plug MD5 in (or any other
> > scheme) should any desire to do so.
>=20
> Here are some other reasons why to use a hash, and not the URL as cache
> keys:
> 1. Is extensible. What we use to build the hash is easily extended to
> include other attributes than the actual URL (for example to support
> Varies:)

Hmmm. If we're using a 128-bit hash, storing it in 128-bits would be
optimal, but it would be nice if we could also support a 'version number'
if we think we might want to support further headers on the object.

ie, store that it is a v1 hash, with value XXXXXX, and then know when we =
use
a v2 hash that if we find a v1 hash we only hash the URL to compare (and
then update it to a v2 hash).

We could store it (in ram and on disk) as a seperate value, but it may
be much more compact/convenient to take a few bits (a nibble? a whole byt=
e?)
out of the hash to store the version number.

> 2. Fixed size =3D=3D much easier memory management for the cache index.

The big advantage I believe - also that the cache log could be transforme=
d
into a (much more compact) binary file since it would be entirely fixed-l=
ength
records. The memory to store URL strings is getting very significant in
some (very large) caches, the reason URL compression was originally menti=
oned.

To answer someone else's suggestion that SHA is slow - isn't SHA is fast =
enough
for the Linux kernel to do on syn/recv cookies, etc? (drivers/char/random=
.c
has USE_SHA by default)

> > Anyway, Kostas is working on adding hit metering into Squid and today
> > we realized that hit metering will become very difficult when the
> > cache key is one-way hash of the URL. When we need to purge
> > an object, we're supposed to make a HEAD request for the URL to
> > report the hits. But we won't have the URL any more.
>
> Then store the URL on disk somehow (at the beginning of the disk object
> for example). It is not that big problem to fetch it (and possibly othe=
r
> needed variables) from the disk at purge-time. I beleive that anything
> that is not needed to support ICP should as much as possible be kept on
> disk.

I think the first line of the file it's in would be a good idea. Then
we can check it when we open the file (for TCP requests ONLY!) (fixes any
possible 'wrong swapfile' bugs and/or provides stats on the theoretically
improbable hash collision).

Accessing this on a purge isn't a big problem; you have to bring the inod=
e
data into RAM to unlink the file to begin with, it's just one more block
to read off disk for the first block of the file and a few more syscalls =
to
access it.

David.

--MimeMultipartBoundary--
Received on Tue Jul 29 2003 - 13:15:44 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:28 MST