Re: MD5 hash query

From: Joe Cooper <joe@dont-contact.us>
Date: Fri, 28 Dec 2001 21:04:28 -0600

Henrik Nordstrom wrote:

> On Saturday 29 December 2001 00.45, Joe Cooper wrote:
>
>
>>Because I need to be able to retrieve the object entry based on it's
>>hashed key (otherwise I would need a separate record for each object to
>>point from the URL to the hashed entry).
>>
>
> And the point being?
>
> store.log has the hash, why compute it again?

Because the client making invalidation requests doesn't have it. I'm
not computing it again in the general case in the indexer, I'm using the
one store.log gives me. I am computing it for the client that is
requesting it, though.

As you point out below, there are two ways I could achieve the lookup
for the client: Hash the URL they request to an MD5 and grab it, as I'm
working on. Or lookup the MD5 in a URL->MD5 table. I prefer generating
it for the client, assuming I don't run into problems with that design.
  I don't really know which is more efficient, but intuitively I tend to
think limiting the need for lookups and additional entry writes is better.

The children/parent issue is also a problem that needs MD5 generation.
Because I know the parent by URL (I simply regex it out of the current
URL), not MD5. I have to look up that parent, but can't without an MD5.
  More importantly I have to /create/ an empty parent reference if it
doesn't already exist--we have no store.log entry or existing parent in
the MD5->URL table to look up in this case.

>>I will describe what I'm trying to accomplish which might make it more
>>clean why I need to be able to match the hash:
>>
>>I'm making a utility that will provide invalidation services for Squid.
>> The first is simply an indexer that tails the store.log, watching what
>>hits the disk. The second is an invalidation server that accepts a
>>request of a specific form (eventually will follow whatever invalidation
>>standard comes into existence, but for now will simply be a tweaked HTTP
>>server) and sends an appropriate PURGE request or batch of PURGE
>>requests to Squid to invalidate the objects that need to be invalidated.
>>
>>The reason I need the hash is because in order to provide "full tree"
>>deletion (i.e. PURGE sub1.domain.com and all of its subdirectories) I
>>have to keep a linked list of items in the tree. So my database will be
>>a key=value Berkeley DB that looks a little like this:
>>
>
> Ah.. now you are providing some interesting details. Limited indexing
> capabilities in the selected database format.

Exactly. All of the parent->child connections, and even client lookups,
could be handled without too much hoop jumping by using a relational
database. But I'm betting the performance would be much worse. Not to
mention setup, maintenence, and all the other headaches that come with
an SQL database (like learning how to use SQL and the DBI, in my case!).

>>MD5-Hash=URL exists|empty-parent children hashes ...
>>
>
> I would probably use at least two tables.
>
> 1: Cached URL tree
>
> 2: MD5 -> URL table
>
> with a garbage collector on the MD5 -> URL table

I did consider this, but it seems like it would be more complicated to
maintain (and I'm not a database genius--so I like that I can think
entirely in hashes when using the DB_File perl module, and I have one
hash to think about in every function).

 
> But your design should also work I think.

I hope so.

It seems to be coming together now. I've got the database building
correctly it seems. Still have to implement the build function (to
traverse the cache_dirs to build the index for the first time) and the
invalidation server.

-- 
Joe Cooper <joe@swelltech.com>
http://www.swelltech.com
Web Caching Appliances and Support
Received on Fri Dec 28 2001 - 20:02:59 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:14:42 MST