Re: MD5 hash query

From: Joe Cooper <joe@dont-contact.us>
Date: Fri, 28 Dec 2001 17:45:14 -0600

Henrik Nordstrom wrote:

> On Friday 28 December 2001 21.17, Joe Cooper wrote:
>
>
>>Henrik, you're my hero!
>>
>>That did the trick right nicely. Now I can get onto the fun stuff
>>(database construction). Robert you were right as well, I was just too
>>tired to notice (you suggested '\1'...I was thinking I had to translate
>>that into a 'real' binary representation--see what sleep deprivation can
>>do to you?).
>>
>
> Translation is recommended if extending to support the whole method list..
> but on the other hand only GET and HEAD are interesting from a cache
> perspective.

That's kinda what I thought. I already added HEAD support, 'just in case'.

 
> Also, if you are building a separate database, why do you need the MD5 hash
> algorithm at all? Isn't it sufficient to simply know what the MD5 is for the
> URL?

Because I need to be able to retrieve the object entry based on it's
hashed key (otherwise I would need a separate record for each object to
point from the URL to the hashed entry).

I will describe what I'm trying to accomplish which might make it more
clean why I need to be able to match the hash:

I'm making a utility that will provide invalidation services for Squid.
  The first is simply an indexer that tails the store.log, watching what
hits the disk. The second is an invalidation server that accepts a
request of a specific form (eventually will follow whatever invalidation
standard comes into existence, but for now will simply be a tweaked HTTP
server) and sends an appropriate PURGE request or batch of PURGE
requests to Squid to invalidate the objects that need to be invalidated.

The reason I need the hash is because in order to provide "full tree"
deletion (i.e. PURGE sub1.domain.com and all of its subdirectories) I
have to keep a linked list of items in the tree. So my database will be
a key=value Berkeley DB that looks a little like this:

MD5-Hash=URL exists|empty-parent children hashes ...

Where URL is the full URL of the object, exists|empty-parent is a BOOL
indicating whether the object exists and can be purged or is simply a
parent to other children objects. Children hashes is a list of all
immediate children in hashed form. When purging an 'empty-parent' with
an empty children hashes list, the entry will be removed from the DB and
its children entry will be removed from its parent.

Thus when a request is received on the invalidation server for
http://www1.domain.com/subdirectory1 with a query string of "recursive"
or even "recursive=5" or similar, the daemon can travel down the
database beginning with the first object (reached by the hashed key) and
throughout its children. The parent-children traversal can be performed
without any MD5 hashing (because the child list is already hashed), and
for the more rare request to present a text representation of the tree
for human viewing or outside processing it can traverse the tree to find
the URL names.

Adding an object updates its parent to include it into the children hash
list. Thus adding any object will rarely require more than two database
writes: one for the object and one for the parent. If the parent(s) do
not exist, then it only requires one extra DB write per parent up to the
'base' or until we reach an already existing parent. Doing this with a
different key (like the URL) would be more difficult, and probably less
efficient on disk usage of space and time.

Clearer? I /thought/ when I first started working on the DB that I
could do it with just the URL and the MD5 from the store.log, but then
when I started thinking on the problem of retrieval of objects, it
became clear that it wouldn't work. I could do it with a relational
database without needing the generate a matching key, but the
performance of PostGre or MySQL isn't really up to the task--in an
accelerator where this is useful, Squid may be pushing 250 reqs/sec or
more. And even ruling out half or three-quarters of those as being
neither SWAPOUTs or RELEASEs, it still might present performance
problems as we scale up the database. Regardless, it needs to be pretty
light-weight since URL indexing is not the primary function of the proxy
server on which it runs. Since we don't /really/ need full relational
behavior there is little reason not to stick with key=value.

BTW-The index will first be built from scratch by polling through all
existing objects (much the way Squid rebuilds its swap.state when
started without one). So theoretically, once we've reached the
'steady-state' of a full cache, most objects will already have parents,
and so the two-write rule should apply most of the time.

Anyway, the neat thing about all this is that there will always be a
quick way to find out what objects are in the cache and what its
children are. It will also be easy to remove any number of objects via
standard HTTP requests, without having to make individual requests for
every object (I've seen a lot of requests for this on the Squid-users
list--I'm doing it for the selfish reason that my clients want it, of
course ;-).

I think it's a pretty solid|simple design, but I always welcome
comments, of course. If you do have comments, though, be quick about
it...I plan to finish up the beta version by tomorrow. ;-)

-- 
Joe Cooper <joe@swelltech.com>
http://www.swelltech.com
Web Caching Appliances and Support
Received on Fri Dec 28 2001 - 16:43:46 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:14:42 MST