Re: [squid-users] How does squid behave when caching really large files (GBs) from Amos Jeffries on 2011-08-18 (squid-users)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Fri, 19 Aug 2011 02:47:25 +1200

On 16/08/11 20:33, Thiago Moraes wrote:
> Hello everyone,
>
> I currently have a server which stores many terabytes of rather static
> files, each one having tenths of gigabytes. Right now, these files are
> only accessed through a local connection, but in some time this is
> going to change. One option to make the access acceptable is to deploy
> new servers on the places that will most access these files. The new
> server would keep a copy of the most accessed ones so that only a LAN
> connection is needed, instead of wasting bandwidth to external access.
>
> I'm considering almost any solution to these new hosts and one of then
> is just using a cache tool like squid to make the downloads faster,
> but as I didn't see someone caching files this big, I would like to
> know which problems I may find if I adopt this kind of solution.

You did mean "tenths" right, as in 100-900 MB files? seems slightly
larger than most traffic, but not huge. Even old Squid installs limited
to 32-bit files should have no problem with handling that as traffic.

Most Squid installs wont store them locally to the clients though. The
default limit is 4MB to cache the bulk of web page traffic and avoid
rarer large objects like yours from pushing much out of cache.
Most of the bumping up mentioned around here is for YouTube and
similar video media content. Only increasing it to tens/hundreds of MB
then stops there for the same caching reasons as the 4MB limit.

Occasionally we hear from ISP or CDN bumping it enough to cache CDs or
DVDs. And OS distribution mirrors, although those also tend to have
smaller package caches. Mostly tens of MB objects.

The CERN Frontier network admins are pushing multiple-TB around via
Squids. It sounds like they are a scale above what you want to do, but
if you want operational experience with big data they could be the best
people to talk to.

>
> The alternatives I've considered so far include using a distributed
> file system such as Hadoop, deploying a private cloud storage system
> to communicate between the servers or even using bittorrent to share
> the files among servers. Any comments on these alternatives too?

No opinion on them as such. AFAIK these don't seem to be really in the
same type of service area as Squid.

If you are after distributed _storage_. Squid is then definitely not the
right solution.

Squid design is more about fast delivery of the data than storage.
Caches being distributed stores is a side effect of that model being
very efficient for delivery rather than any effort to spread the
locations of things. Cache storage is fundamentally a giant /tmp
director. Persistent but liable for erasure any given second. A chunk of
it is often found only in volatile RAM too.
Bittorrent perhapse is closest in a matter of being delivery oriented
rather than storage. With one authority source and a hierarchy of
intermediaries doing the delivery. Thats where the similarities end as well.

If what you are after is scalable delivery mechanism that can minimize
the bandwidth consumption, Squid is definitely an option there.

You can layer a whole distributed background set of storage servers
behind a gateway layer of Squid. Using the various peering algorithms
and ACL rules for source selection.

Those background layer servers can in turn use any of the actual
storage-oriented methods you mention to actually store the content. If
they still need scale. With web services to provide the files as HTTP
objects from each location to the Squid layer.
WikiMedia have some nice CDN network diagrams published if you want to
see what I mean: http://meta.wikimedia.org/wiki/Wikimedia_servers

Sorry, talked you round in a circle there. But I hope its of some help.
At least of where and whether Squid can fit into things for you.

Amos

-- 
Please be using
   Current Stable Squid 2.7.STABLE9 or 3.1.14
   Beta testers wanted for 3.2.0.10

Received on Thu Aug 18 2011 - 14:47:31 MDT

This archive was generated by hypermail 2.2.0 : Thu Aug 18 2011 - 12:00:04 MDT