"MSI"
:some history, you can skip if you want.
That is how it's done!
for a long time i was looking for a good way to cache dynamic objects 
using squid 3.2 but always it came to use another outside software such 
as apache nginx or other options.
so i was looking for a better way then store_url_rewrite because it was 
kind of "hack" to the whole problem of dynamic content.
i have found that squid is very good at what he does ... forward proxy!
some problems that i encountered with programs out of squid is that you 
always need to manage your own cache size and objects.
on top of that i have seen how servers are redirecting using 302 code 
and then serving on the same address.
i dont know what happens internally in these cache servers but it works.
so instead of using some "web servers" to server as proxy using php or 
other trick let the proxy hierarchy do the job for us!
the idea is to use a "MSI" and it means:  mysql + squid(x2) + icap !
the problem with dynamic content is that it has a lot of dynamic content 
and it means also "headers".
my solution is more then just for youtube.... it's a solution for the 
dynamic content caching problem!
(can also be used for 206 partial content manipulation)
some history on how it's done before and how it's implemented now with 
better options!
squid is caching http objects based on couple of things:
the main part is the object url as the identification of an object in 
the cache.
second level is the object cache headers and structure.
third and last forced refresh_patterns\rules per http object\url.
store_url_rewite took the problem solution on the url level.
it takes one object and refer to it as another.
the problems that came with that is that the refresh patterns was 
refering to the original urls and not the the real objects that are in 
cache.
also the logs are logging per dynamic urls and there for you can really 
benefit from the logs on how much this method is good for caching.
you can't purge cache objects and also can't verify if the object cached 
on the server or to clear an object from cache using htcp protocol.
some people were using apache\nginx with php script that fetch the 
dynamic content and cached it on the webserver storage.
this makes the very very fast proxy software to crawl and also to do 
things on interpretation lever of php instead of a very very fast 
compiled robust proxy server.
so these solutions are nice but they must be maintained and monitored 
manually for space performance and availability.
another problem with these caches is that these servers are not really 
caching the whole object but reserving it.
after using cache proxy hierarchy for quite a time with some squid 2.7 
with store_url_rewrite i took my idea to use icap and made the 
unimpossible possible!
review of what we will do:
we will take one proxy server with at least two squid instances, one for 
cache and the other with no cache at all\minimal.
one of the servers(mem only) is binded only to lo interface and the 
other intercepts\forward requests.
we will also install on this proxy server mysql db server and the ICAP 
server that you desire\have.
i have used GreasySpoon at : http://greasyspoon.sourceforge.net/
it's based on java and really fast but as for a basic setup we will use 
only reqmod(request review), with a more advanced setup we can use a 
response headers manipulation to make the object "cache friendly".
this is the software.
now the idea:
we can use the ICAP server to rewrite requests transparently to the 
client (and also for a server that is a client of our server).
so we setup two instances of squid proxy based on two different conf 
files (can be done with one compiled squid).
the first one is the main cache and we will send every request we want 
to manipulate to the icap server based on acls(very very important to 
plan them!!!)
on this instance we will configure the other instance on the lo 
interface as cache peer that is *NOT* a proxy-only server and a parent.
we will select an internal domain such as "squid.internal" to use for 
object storage schema.
on this domain we will define a never_direct policy and we will peer all 
requests for this spoofed domain to the second instance.
on the second instance we will limit the access of request (reqmod) only 
for this spoofed domain.
now the fun begins!
it's time to combine MYQSL(memory db) + squid + icap.
first we will analyze what we want to do.
an example is to cache all sourceforge cdn downloads as to one object.
this is a file download link:
http://dfn.dl.sourceforge.net/project/npp-compare/1.5.6/compare-1.5.6-unicode.zip
http://iweb.dl.sourceforge.net/project/npp-compare/1.5.6/compare-1.5.6-unicode.zip
you will notice that the only different is on the low level domain and 
all other parameters are the same so to server this object from cache 
for two cdns we only need to use one simple url schema.
what we will do will be a bit complicated to understand and i hope it 
will be simplified now.
we know that the proxy servers do not reveal the request modification 
that is done by the icap server and this specific icap server 
(GreasySpoon) has a very very powerful capabilities of external and 
custom libs classes and programing languages.
we will create a database with couple of fields for temporary data and 
if we want we can also build some statistics tables in the db.
the purpose of the database is to store destination url and compatible 
key and will be managed by the key and not the url because the url is 
dynamic..
we will do a double request manipulating on each request.
one one the intercept\forward proxy and the second is on the 
cache_peer\second instance proxy.
the flow is like that:
request from client ------------------------------------->proxy1
proxy1--------------------------------------------------->ICAP server
proxy1 acl on the real domain to reqmod on ICAP
icap server(extracting the data of the object from url and paring them 
on the db with the url, then rewrites the request to a spoofed domain 
with the key on the uri) ----->proxy1
example:
http://dfn.dl.sourceforge.net/project/npp-compare/1.5.6/compare-1.5.6-unicode.zip
became: 
http://dl.df.squid.internal//project/npp-compare/1.5.6/compare-1.5.6-unicode.zip
and paired in the db as id and the original url with timestamp.
proxy1  request as client  the spoofed object -------------->proxy2
proxy1 acls for "squid.internal" dstdomain is to peer it to proxy2
proxy2--------------------------------------------------------->ICAP
proxy2 has acls that allow only spoofed domains ".squid.internal" to 
reqmod the ICAP server (to prevent an endless loop).
ICAP server------------------------------------------------>proxy2
the icap server rewrites the paired url instead of the key.
this is because we want to fetch the real object recursively into proxy1 
cache.
in this state of proxy 1 thinks it's fetching the spoofed key aka:
http://dl.df.squid.internal//project/npp-compare/1.5.6/compare-1.5.6-unicode.zip
but proxy2 is feeding him:
http://dfn.dl.sourceforge.net/project/npp-compare/1.5.6/compare-1.5.6-unicode.zip
proxy2------------>proxy1---------->client
so this specific state is logically like that:
client thinks he fetches the real file.
proxy1 fetch a spoofed file\url from proxy2
proxy2 fetch the real file\url from the real server to proxy1.
but next time that a client will try to get one of the objects:
http://dfn.dl.sourceforge.net/project/npp-compare/1.5.6/compare-1.5.6-unicode.zip
http://X.dl.sourceforge.net/project/npp-compare/1.5.6/compare-1.5.6-unicode.zip
http://yyy.dl.sourceforge.net/project/npp-compare/1.5.6/compare-1.5.6-unicode.zip
if proxy1 will have the spoofed object:
http://dl.df.squid.internal//project/npp-compare/1.5.6/compare-1.5.6-unicode.zip
in cache he will serv it from there otherwise it will be fetched from 
the internet using proxy2.
*this is the main concept*
i have a working setup for:
youtube
ytimg
imdb mp4\flv
sourceforge
some of facebook content
bliptv
vimeo
dailymotion
metacafe
av updates.
Filehippo
linux distros repos.  (need to make a change in the db\key 
structure\match rules)
if you have more features that can be good i will be happy to try.
(there is a access.log file with some nice data)
Regards,
Eliezer
-- Eliezer Croitoru https://www1.ngtech.co.il IT consulting for Nonprofit organizations eliezer <at> ngtech.co.il
This archive was generated by hypermail 2.2.0 : Sun Jun 10 2012 - 12:00:03 MDT