Re: [squid-users] Squid URL list -- as search engine helper?

From: Joe Cooper <joe@dont-contact.us>
Date: Sat, 02 Nov 2002 06:13:32 -0600

eichin-squid@thok.org wrote:

> I have a slight variation on the "what URLs does squid know about"
> question... It would be useful to be able to use squid to reduce web
> crawling overhead for a search engine, not merely by direct caching,
> but by having secondary indexers get an explicit lists of what pages
> are "free" to fetch.
>
> I've come up with a few ways of doing this, all of which have flaws:
>
> 1) just export the squid logs.
> * not incremental
> * not accurate - they show what has been seen, but not what is
> still around
> 2) use a redirect_program
> * performance risk
> * *only* incremental
> * requires new programs on the cache box
> 3) squidclient cachemgr:objects
> * not incremental (but fast enough to make this less of a problem)
> * only has MD5 hashes for objects that aren't still in memory
>
> The last option was the most interesting until I discovered the
> in-memory distinction -- I've learned a bunch more about squid
> internals in the process, though :-) Basically, vm_objects gives
> everything for which URLs are still around; objects gives everything,
> but only reports hash keys for the on-disk ones (since that's all it
> knows.)
>
> This approach might be salvagable, for example, if there were a way
> [which I haven't found] to retrieve a document by cache-key instead of
> filename -- the cache knows what URL the cache-key maps to once it
> opens the file, after all.
>
> Any thoughts? Suggestions for other approaches? Of course I have a
> preference for using builtin features, since it is easier to get
> administrative cooperation about config file changes than installing
> new programs.

Squid tells you where to find the object in the form "Swap Dir 0, File
0X0004B2" which maps to subdirectory "00/04/000004B2" of the first
cache_dir. It doesn't look obvious, but it is consistent, so you can
extricate objects by keys and look up their contents.

The purge utility might be of assistance to you, as it already has some
pretty solid code to deal with this stuff. And in fact, it can provide
a list of all objects in the cache by URL, by trolling the filesystems
of your cache.

-- 
Joe Cooper <joe@swelltech.com>
Web caching appliances and support.
http://www.swelltech.com
Received on Sat Nov 02 2002 - 05:13:35 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 17:11:08 MST