[squid-users] How does squid behave when caching really large files (GBs)

From: Thiago Moraes <thiago_at_cmoraes.com>
Date: Fri, 19 Aug 2011 12:59:06 -0300

I meant files with ranging from 100MB to 30GB, but mostly above the
10GB milestone, so that's the size of my problem. I saw the CERN case
on squid's homepage, but their files had, at maximum, 150MB, as said
in the paper. I'll try to learn a little more from their case, though.

They really are not in the same area of squid. The question is a have
to make less painful to download huge files and try to avoid using a
WAN. Having a server inside a LAN connection makes more sense in my
head, but I don't have limitations as the project is fresh and is
entirely in my hands. I can develop something in a layer above my
system (which would run in my "main" server) such as squid or I can
make every place have its own system deployed. In the last case, I
would need a way to share files between multiple instances of the same
program running and a distributed file system made more sense to me.
(don't know if I made myself clear here, english is not my first
language and if it's a little messy, don't mind in asking me again)

The problem with the architecture of multiple instances of my system
sharing files (which could even be done via rsync or else) is that the
main database has more than 40TB of data. Its copies may not have all
this space available and I would need to find a solution to choose
which files will reside in each server (and the changes along the
time). For me, this seens to be the kind of problem a cache server is
capable to solve and would save a lot of effort. Is this viable?

I hope I have my problem a little clearer now. Do you have any more
thoughts to share? And thanks for your time, Amos, it helped me and I
appreciate your help.

Thiago Moraes - EnC 07 - UFSCar

2011/8/18 Amos Jeffries <squid3_at_treenet.co.nz>
>
> On 16/08/11 20:33, Thiago Moraes wrote:
>>
>> Hello everyone,
>>
>> I currently have a server which stores many terabytes of rather static
>> files, each one having tenths of gigabytes. Right now, these files are
>> only accessed through a local connection, but in some time this is
>> going to change. One option to make the access acceptable is to deploy
>> new servers on the places that will most access these files. The new
>> server would keep a copy of the most accessed ones so that only a LAN
>> connection is needed, instead of wasting bandwidth to external access.
>>
>> I'm considering almost any solution to these new hosts and one of then
>> is just using a cache tool like squid to make the downloads faster,
>> but as I didn't see someone caching files this big, I would like to
>> know which problems I may find if I adopt this kind of solution.
>
>
> You did mean "tenths" right, as in 100-900 MB files? seems slightly larger than most traffic, but not huge. Even old Squid installs limited to 32-bit files should have no problem with handling that as traffic.
>
>
> Most Squid installs wont store them locally to the clients though. The default limit is 4MB to cache the bulk of web page traffic and avoid rarer large objects like yours from pushing much out of cache.
>  Most of the bumping up mentioned around here is for YouTube and similar video media content. Only increasing it to tens/hundreds of MB then stops there for the same caching reasons as the 4MB limit.
>
>  Occasionally we hear from ISP or CDN bumping it enough to cache CDs or DVDs. And OS distribution mirrors, although those also tend to have smaller package caches. Mostly tens of MB objects.
>
>  The CERN Frontier network admins are pushing multiple-TB around via Squids. It sounds like they are a scale above what you want to do, but if you want operational experience with big data they could be the best people to talk to.
>
>
>>
>> The alternatives I've considered so far include using a distributed
>> file system such as Hadoop, deploying a private cloud storage system
>> to communicate between the servers or even using bittorrent to share
>> the files among servers. Any comments on these alternatives too?
>
> No opinion on them as such. AFAIK these don't seem to be really in the same type of service area as Squid.
>
> If you are after distributed _storage_. Squid is then definitely not the right solution.
>
>  Squid design is more about fast delivery of the data than storage. Caches being distributed stores is a side effect of that model being very efficient for delivery rather than any effort to spread the locations of things. Cache storage is fundamentally a giant /tmp director. Persistent but liable for erasure any given second. A chunk of it is often found only in volatile RAM too.
>  Bittorrent perhapse is closest in a matter of being delivery oriented rather than storage. With one authority source and a hierarchy of intermediaries doing the delivery. Thats where the similarities end as well.
>
>
> If what you are after is scalable delivery mechanism that can minimize the bandwidth consumption, Squid is definitely an option there.
>
>  You can layer a whole distributed background set of storage servers behind a gateway layer of Squid. Using the various peering algorithms and ACL rules for source selection.
>
>  Those background layer servers can in turn use any of the actual storage-oriented methods you mention to actually store the content. If they still need scale. With web services to provide the files as HTTP objects from each location to the Squid layer.
>  WikiMedia have some nice CDN network diagrams published if you want to see what I mean: http://meta.wikimedia.org/wiki/Wikimedia_servers
>
> Sorry, talked you round in a circle there. But I hope its of some help. At least of where and whether Squid can fit into things for you.
>
> Amos
> --
> Please be using
>  Current Stable Squid 2.7.STABLE9 or 3.1.14
>  Beta testers wanted for 3.2.0.10
Received on Fri Aug 19 2011 - 15:59:32 MDT

This archive was generated by hypermail 2.2.0 : Sat Aug 20 2011 - 12:00:02 MDT