Re: [squid-users] caching data for thousands of nodes in a compute cluster from Dave Dykstra on 2007-06-12 (squid-users)

From: Dave Dykstra <dwd@dont-contact.us>
Date: Tue, 12 Jun 2007 11:42:42 -0500

On Tue, Jun 12, 2007 at 12:19:26AM +0200, Henrik Nordstrom wrote:
> m??n 2007-06-11 klockan 15:17 -0500 skrev Dave Dykstra:
>
> > of jobs. It quickly becomes impractical to distribute all the data from
> > just a few nodes running squid, so I am thinking about running squid on
> > every node, especially as the number of CPU cores per node increases.
> > The problem then is how to determine which peer to get data from.
>
> Multicast ICP sounds like it could be a reasonable option there.
>
> Regards
> Henrik

I considered that, but wouldn't multicasted ICP queries tend to get many
hundreds of replies (on average, half the total number of squids)? It
would only use the first response it got back, but it doesn't seem very
efficient of network or compute resources to throw away all the others.
Do you know of other people who have used multicast ICP for this type of
application?

The multicast TTL could help a little but probably not much. I expect
the servers are usually organized in smaller groups, with better network
connectivity within each group, but it isn't practical to ask the system
administrators to tell us which servers are in which group so everything
has to be automatic. They're very likely all on the same large subnet
with the switches sorting out the routing, so it isn't clear that
anything at squid's level would be able to tell how far away servers are
other than by small differences in response time, or more likely
throughput of large transfers. I also don't think we can really expect
we know can know the names of all the peers in order to list them in
"multicast-responder".

- Dave
Received on Tue Jun 12 2007 - 10:42:44 MDT

This archive was generated by hypermail pre-2.1.9 : Sun Jul 01 2007 - 12:00:04 MDT