[squid-users] caching data for thousands of nodes in a compute cluster from Dave Dykstra on 2007-06-11 (squid-users)

From: Dave Dykstra <dwd@dont-contact.us>
Date: Mon, 11 Jun 2007 15:17:57 -0500

Hi,

I have been thinking about the problem of quickly distributing objects
to thousands of jobs in a compute cluster (for high energy physics). We
have multiple applications that need to distribute the same data to lots
of different jobs: some applications distributing hundreds of megabytes
to thousands of jobs and some distributing gigabytes of data to hundreds
of jobs. It quickly becomes impractical to distribute all the data from
just a few nodes running squid, so I am thinking about running squid on
every node, especially as the number of CPU cores per node increases.
The problem then is how to determine which peer to get data from. As
far as I can tell, none of the methods currently supported by squid
would work very well with thousands of squids (especially considering
that there would often be a small number of them that are out of service
so it would be hard to statically configure them). Am I right about
that? It seems to me that it would work better if there were a couple
of nodes that could dynamically keep track of which nodes had which
objects (over a certain size), and could direct requests to other nodes
that had the objects or were in the process of getting them. It's quite
a bit like the approach that peer-to-peer systems like bittorrent use,
although I haven't found any existing implementations that would be
appropriate for this application and I think it is probably more
appropriate to extend squid.

- Dave Dykstra
Received on Mon Jun 11 2007 - 14:17:57 MDT

This archive was generated by hypermail pre-2.1.9 : Sun Jul 01 2007 - 12:00:04 MDT