Archive an extra copy of documents

From: Brad Tofel <brad@dont-contact.us>
Date: Thu, 22 Aug 2002 21:11:07 -0700

Hi all,

I'm interested in using squid within the Wayback Machine(www.archive.org),
to cache often-requested documents(if users request documents that aren't in
the archive, we grab a copy from the live web, and return it to them.) Right
now, each of the CGI servers in our farm independantly maintains a
cache(very dumb-like) of recently retrieved documents, so each machine in
the farm will end up retrieving the documents before they're really
"cached."

My thinking is to have all the requests to the live web from the CGI
machines go through squid, and then to modify squid so that in addition to
it's standard cache, documents are also written to ARC files (our very
simple archival format: a metadata line with the URL, the remote IP, a
timestamp, mime-type and document-length, followed by the document
itself[which includes HTTP headers]. Append more metadata-document chunks
until your file is 100MB, and that's an ARC file.) As ARC files are
generated, we'd pull them off the squid host and move them into the main
archive.

We only want to append documents to our ARC file as they are actually
downloaded from the live web(the cache missed or the cached version was too
old.)

We'll only be using squid for HTTP traffic, so at first blush it seems that
we want to put our code into http.c, where there is the smarts to know that
an HTTP connection has just completed successfully, but of course this is a
simplistic view of some complex code. Another good choice seems to be in
store.c or store_io.c, but I'm confused by which functions are being used
when a "cache-retrieval completes", versus a "download-to-cache" completes.
I've looked thru the online docs, but haven't found anything yet that's
giving me much traction.

Can someone with a good understanding of the code give me a head-start on a
good approach to implementing this, or a pointer to the right documentation
that I've missed? Is squid overkill for what we're trying to do?

If there's anyone interested in helping with this customization, that would
be fantastic, too! My guess is that the right person could whip this out in
a matter of hours.

Great job on building such a great open-source tool, and thanks!

Brad Tofel
brad@archive.org
Received on Thu Aug 22 2002 - 22:08:12 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:16:07 MST