Re: [squid-users] Duplicate files, content distribution networks

From: Jack Bates <nwv96b_at_nottheoilrig.com>
Date: Thu, 14 Jun 2012 01:53:56 -0700

On 18/05/12 05:55 AM, Eliezer Croitoru wrote:
> On 18/05/2012 10:33, Jack Bates wrote:
>> Are there any resources in Squid core or in the Squid community to help
>> cache duplicate files? Squid is very useful for building content
>> distribution networks, but how does Squid handle duplicate files from
>> content distribution networks when it is used as a forward proxy?
>>
>> This is important to us because many download sites present users with a
>> simple download button that doesn't always send them to the same mirror.
>> Some users are redirected to mirrors that are already cached while other
>> users are redirected to mirrors that aren't. We use a caching proxy in a
>> rural village here in Rwanda to improve internet access, but users often
>> can't predict whether a download will take seconds, or hours, which is
>> frustrating
>>
>> How does Squid handle files distributed from mirrors? Do you know of any
>> resources concerning forward proxies and download mirrors?
> squid 2.7 has the store_url_rewrite option that does what you need.
> sourceforge is one nice example for a cdn files download based mirrors.
> and you can always use the cache_peer option to use the main squid as a
> more updated version and to use only for what you need such as specific
> domain from the older version.

Thanks very much for pointing out the store_url_rewrite option Eliezer.
Does it require the proxy administrator to manually configure the list
of download mirrors?

Does anyone in the Squid community have thoughts on exploiting Metalink
[1] to address caching duplicate files from content distribution networks?

The approach I am pursuing is to exploit RFC 6249, Metalink/HTTP:
Mirrors and Hashes. Given a response with a "Location: ..." header and
at least one "Link: <...>; rel=duplicate" header, the proxy looks up the
URLs in these headers in the cache. If the "Location: ..." URL isn't
already cached but a "Link: <...>; rel=duplicate" URL is, then the proxy
rewrites the "Location: ..." header with the cached URL. This should
redirect clients to a mirror that is already cached

Thoughts?

Another idea is to exploit RFC 3230, Instance Digests in HTTP. Given a
response with a "Location: ..." header and a "Digest: ..." header, if
the "Location: ..." URL isn't already cached then the proxy checks the
cache for content with a matching digest and rewrites the "Location:
..." header with the cached URL if found

I am working on a proof of concept plugin for Apache Traffic Server as
part of the Google Summer of Code. The code is up on GitHub [2]

If this is a reasonable approach, would it be difficult to build
something similar for Squid?

[1] http://metalinker.org/
[2] https://github.com/jablko/dedup
Received on Thu Jun 14 2012 - 08:49:32 MDT

This archive was generated by hypermail 2.2.0 : Thu Jun 14 2012 - 12:00:06 MDT