Re: [squid-users] Squid and Search Engines

From: OTR Comm <otrcomm@dont-contact.us>
Date: Sun, 08 Feb 2004 11:36:15 -0700

> Technicallly it should be possible, but you need to write another
> retreiver spider for the engine knowing how to read the squid cache files
> instead of fetching from the web or indexing local files.
>
> The format of the cache files are described in the programmers guide and
> iirc there is even a perl module in CPAN for reading these files.

That was my next question; i.e. how do I read the cache?
Do you by any chance know the name of the CPAN module?

I looked at CPAN and found the Cache-2.01 module, is this the one?

> The developer list for the preferred search engine is a better place to
> ask I think. There is no modifications required to Squid but the search
> engine needs to be slightly modified to know how to read the Squid cache
> data.
>
> Each file in the cache contains
>
> a) Meta data like the URL of the file, size, time cached etc. Of this the
> search engine needs to use the URL as "name" of the indexed object.
>
> b) The object HTTP headers.
>
> c) The object contents. This is what needs to be indexed.
>
> b+c is the HTTP reply as received by Squid.

When I do a 'file' on a particular cache file, I get back that it is
DBase 3 format, is this correct, or is this just the closest that Linux
can get on determining the type of file? The question really is, how do
I put the cached file back into it's original format, with it's original
title for presentation to the server?

I looked at the 'purge' utility written by Jens-S. Vöckler since it can
decipher the squid cache, but I don't understand how it is working.

For example, I have a cache file:

/usr/local/squid/var/cache/00/09/0000092D

with header information:

^Co
Content-Length: 2173
Content-Type: image/gif
Last-Modified: Sun, 11 Jan 2004 05:20:46 GMT
Accept-Ranges: bytes
ETag: "5db8d2aa2d8c31:627d33"
Server: Microsoft-IIS/6.0
Date: Thu, 22 Jan 2004 03:02:01 GMT
Connection: close
<snip>

and from that, the 'purge' utility returns the URL of:

http://www.whitehouse.org/kids/images/tn-palm.gif

How is the URL deciphered? For the life of me, I can't figure it out.

I read in the Programming Guide that "A cache swap file consists of two
parts: the cache metadata, and the object data."

Could you please point me to the code in squid that will show me how to
get at and decipher the metadata?

I am sorry t be such a bother, but I get totally lost in the squid code,
so pointer to the correct modules to look in will be very much
apprectaited.

Thanks,
Murrah Boswell
Received on Sun Feb 08 2004 - 11:36:39 MST

This archive was generated by hypermail pre-2.1.9 : Mon Mar 01 2004 - 12:00:02 MST