Re: I started writing some document for squid wiki and i need a review on it. from Amos Jeffries on 2012-07-10 (squid-dev)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Tue, 10 Jul 2012 20:20:53 +1200

Hi Eliezer,
Thank you for the write up. Some of this background info seems a bit
wrong to me. Comments inline below.

Amos

On 10/07/2012 5:55 p.m., Eliezer Croitoru wrote:
> I started writing some document for squid wiki and i need a review on it.
> (moinmoin file format attached)
> it's only a draft with open mind and little knowledge.
>
> the document has couple topics about cache and dynamic content.
> it can be split into couple documents later.
> the lead in it is my implementation of youtube\dynamic content caching
> using ICAP server.
> there is some very simple ruby code in it.
>
> I have just heard about ESI caching and never really given\seen a
> simple "this is how ESI works\implemented" and i will be more then
> glad to hear about it( i dont want to read pages over pages. just
> couple examples that will show the concept)
>
> another Issue is:
> i integrated in MY ICAP server also a basic content filtering feature
> based on squidGuard\danshGuardian domains blacklists.
>
> i did tests on my ICAP server and squid that is connected to it.
> squid + ICAP on intel atom D510 will stand 800++ Requests Per Second.
> ICAP server by itself can handle 1150++ Requests Per Second with a
> maximum delay of 1.02 seconds response delay in a case of 30 seconds
> load build up to 3735 concurrent connections(4 workers\forks).
>
> i will have in the next couple weeks a much stronger hardware(24 cpus
> 32GB ram) just for this tests.
>
> I am looking for Linux kernel + Mysql Tweaks recommendation for this
> machine.
>
> I will work with no logs to lower the disk access.
> I already have Mysql query and table cache to make it all faster.
>
> Thanks,
> Eliezer
>
> --
> Eliezer Croitoru
> https://www1.ngtech.co.il
> IT consulting for Nonprofit organizations
> eliezer <at> ngtech.co.il
>
>
> cache_moinmoin.txt
>
>
> ##master-page:CategoryTemplate
> #format wiki
> #language en
>
> = Caching Dynamic Content with Icap =
> <<Include(ConfigExamples, , from="^## warning begin", to="^## warning end")>>
> This page is an ongoing development. Not least because it must keep up with youtube.com alterations. If you start to experience problems with any of these configs please first check back here for updated config.
>
> <<TableOfContents>>
>
> == Problem Outline ==
> Squid since old days till now(3.2) use mainly the URL of a http request as the "object key" in cache.
Squid versions refrenced by wiki-link - [[Squid-3.2]] - but there is
nothing special about 3.2 in this regard. The URL *is* the resource key.
It has been and remains the fundamental design property of HTTP.
squid-3.2 is not changing that.
> this approach based on the assumption that each GET request of a URL should identify one and only one object.
s/object/resource/ - and this difference in definition appears to be the
source of the wrong statements below ...
> dynamic content should be sent based on user data in a POST request.
> as defined in [[http://www.ietf.org/rfc/rfc1945.txt|rfc1945]] section 8.1 for GET and 8.3 for POST
>
> 8.1 "The GET method means retrieve whatever information (in the form of an entity) is identified by the Request-URI."
>
> 8.3 "The POST method is used to request that the destination server accept the entity enclosed in the request as a new subordinate of the resource identified by the Request-URI in the Request-Line."

RFC 2616 is a better reference for this. Squid has obeyed RFC 2068 since
about version 2.5, and seeks to obey RFC 2616 since about version 2.6.
Finally reaching 2616 compliance in 3.2.

>
> the rfc is there as a "language" and users\developers\systems that is not part of rfc obligated entity cannot be enforced to apply them.

What is this about?

"entity" in the RFC texts is about the data bodies of requests and
replies. "users/developers/systems" are irrelevant at that level.

>
> == What is Dynamic Content ==
> Dynamic content basically can vary in two ways:
> 1. one URL that can result in more then one object.( one to many )
> 2. two urls that result the same identical object.( many to one )

(2) is not dynamic content. I can point a hundred URLs at the same image
and it does not become dynamic.

Dynamic content is about the entity resource being generated on request.
The usual result of that is an entity which varies with each request and
contains request-specific information. eg web pages which contain the
name of the user logged in and requesting it. So (1) is correct, but not
for the reasons you seem to be describing.

>
> there are reasons for each and one of them.
> * The result of a live content feed based or not on argument supplied by end user.
> * a CMS(Content Management System) scripts design.
> * a temporary URL for content access based on credentials.
This one is only dynamic if the page generates a "200 OK" status
response which is different when logged out, or for each user when
logged in.
> * bad programing or fear from caching
> * Privacy policies

These are related to your definitions, not to dynamic content per-se.

>
> == Marks of dynamic content in URL ==
> squid applies a refresh pattern acl on [[ConfigExamples/DynamicContent|dynamic content]] marks in the URL such "?" and "cgi-bin" by default to prevent caching them.
> {{{
> refresh_pattern -i (/cgi-bin/|\?) 0 0% 0
> }}}

In text we prefix squid.conf directive names with "SquidConf:" as is
SquidConf:refresh_pattern to wiki-link to the documentation of that
directive.

>
> === "?" ===
> question mark append to the URL is used to pass arguments to a script and can represent "dynamic content" page that will vary by the arguments.
> the url:"http://wiki.squid-cache.org/index.html?action=login" will pass to the argument "action=login" to the wiki server and will result a login page.
> if you will send an argument to a static html file such as:"http://www.squid-cache.org/index.html?action=login" the result is just a longer url.
> many CMS like Wordpress use question mark to identify a specific page\article stored in the system. ("/wordpress/?p=941")
>
>
> === CGI-BIN ===
> many systems use CGI to run a script on a server that will result html output or not.
> i wrote a simple CGI script that shows the public ip address used to contact my server:
> http://www1.ngtech.co.il/cgi-bin/myip.cgi
> this script result will vary for each user by the server and shouldn't be cached.
NOTE: the only reason these should not be cached is that old CGI systems
commonly do not send cache-control or expiry information to permit safe
caching. Same goes for the "?" query string scripts. The refresh_pattern
directive is specifically used so that dynamic content responses which
*do* contain sufficient cache control headers *are* cached.

> there is a convention about CGI scripts to run under "cgi-bin" directory as a mark of live feed.
>
> == HTTP and caching ==
> Mark Nottingham wrote a very detailed document [[http://www.mnot.net/cache_docs/|"Caching Tutorial for Web Authors and Webmasters"]] about cache
> that i recommend to read.
> and also wrote a great tool to analyze cache headers of sites [[http://redbot.org/|RedBot]]{{http://redbot.org/favicon.ico}}
> === HTTP headers ===
> Else then the URL itself there are couple http headers that can affect the results of a request and there for cache.
> the http response can vary between clients by request headers like "User-Agent" "Cookie" or others.
>
> its very common that "User-Agent" uses to identify the client software and response differently.
> it can distinct a mobile cell phone and a desktop or html format compatibility of a client.
> these headers can affect response language,content and compression.
>
> cache specific headers can be used by a client to identify validity of current cached objects.
> the "Expires" and "Etag" can identify singes of expired cache object.
>
> To help cache efficiency the http headers and codes came for help.
> a cache can use a request with "If-Modified-Since" header and the server can verify for the client that the file hasn't changed "Since" with a "304" response code.
> vary of headers can assist in this situation.
>
> Common request headers are:
> {{{
> User-Agent:
> Accept-Language:
> Accept-Encoding:
> Cookie:
> If-Modified-Since:
> If-None-Match:
> }}}
>
> Common response headers are:
> {{{
> Cache-Control:
> Expires:
> Accept-Ranges:
> Transfer-Encoding:
> Vary:
> Etag:
> Pragma:
> }}}
>
> === HTTP 206\partial content ===
> Some problems arise when we want to cache HTTP/1.1 206 response.
> * how to cache if at all?
> * how to handle chunks?
As individual entities, there is no reason to do anything else.

> * if we want to merge the chunks into one object, how?
Only if ETag is present and identical on both chunks.

> * how to handle overlapping ranges?
> * how to handle different types of ranges?

There is only one type permitted: bytes.

These are fixed algorithms outlined in the RFCs when cachign range
responses. The fact Squid has not been caching range responses does not
make these into problems.

>
> == Dynamic-Content bandwidth consumers ==
> The majority of the web will not be a kill while not being cached.

"a kill" ??

> if you will look at some ISP\office graphs you will see that there is a pattern shapes the graphs.
> Software updates and videos content are well known "Dynamic content" bandwidth consumers.
Um. These are simple cache-unfriendly contents rather than true dynamic.
YouTube content is dynamic only in that the HTML pages and URLs are
generated dynamically. The videos themselves are static content, which
is why the old re-writer/redirector methods from befre store_url worked.
Software updates are usually cacheable, WindowsUpdate being the
exception because it does range requests randomly into the interior of
archive files and Squid in particular does not cache range requests.

>
> Squid developers tried before to reason youtube being more cache friendly but it got into a dead end from youtube side.
>
> == Specific Cases analysis ==
> * Microsoft updates
Not dynamic. Very static, with range requests into the objects. The
problem is just Squid missing HTTP features.

> * Youtube video\img
Not really dynamic. Pages are, and URLs are dynamically created, but
they de-duplicate down to static video locations.

> * CDN\DNS load balancing
Not dynamic. Simple URL de-duplication.

> * Facebook

Here is the real dynamic content. With each page being constructed of
numerous database snippets, sometime in live streams.

> === Microsoft Updates Caching ===
> The main problem with Microsoft updates is that they use 206 partial content responses that cannot be cached by Squid.
> some times the update file size is tens of MB and will lead to heavy load.
> a solution for that was proposed by Amos Jeffries at: [[http://wiki.squid-cache.org/SquidFaq/WindowsUpdate|SquidFaq/WindowsUpdate]]
> in order to save maximum bandwidth force Squid into downloading the whole file instead of a partial content using:
> {{{
> range_offset_limit -1
> quick_abort_min -1
> }}}
> [[http://www.squid-cache.org/Doc/config/range_offset_limit/|range_offset_limi]] [[http://www.squid-cache.org/Doc/config/quick_abort_min/|quick_abort_min]]
>
> the problem is that these acls applies for the whole server and can result some software response bad while expecting a partial response.
> other then that a chunk of 1KB out of a 90MB file will result in a 90MB bandwidth waist.
> so it's up to the proxy admin to set the cache properly.
>
> === Youtube video\img ===
> Youtube serves video content requests by a user to apply polices like "allow only specific user\group\friends" etc.
> a video will be served to the same client differently in matter of a second.
> most of the video urls has some common sense identity in the form of arguments so in a way it can be cached.
> since squid mainly use the URL to identify the cache object it makes cache admins life harder.
> and it doubles by the random patterns of videos URL.
>
> in the past there were couple attempts to cache them using the old [[http://www.squid-cache.org/Doc/config/storeurl_rewrite_program|"store_url_rewrite"]] in Squid2.X.
> other solution was using the "url_rewrite" combined with Web-server mentioned at [[http://wiki.squid-cache.org/ConfigExamples/DynamicContent/YouTube|ConfigExamples/DynamicContent/YouTube]]
>
> === CDN\DNS load balancing ===
> Many websites use CDN(Content Delivery Network) to scale their website.
> some of these are using same URL on other domain.
> one of the major opensource players that i can demonstrate with is SourceForge.
> they have mirrors all over the world and they use a prefix domain to select the mirror like in:
> {{{
> http://iweb.dl.sourceforge.net/project/assp/ASSP%20Installation/README.txt
> http://cdnetworks-kr-2.dl.sourceforge.net/project/assp/ASSP%20Installation/README.txt
> }}}
> so this is a case when "2 urls per points one file" scenario that can be somehow resolved easily by storing all the sub-domains under one key.
> kind of a pseudo for this:
> every subdomain of "dl.sourceforge.net" should be sotred as: "dl.sourceforge.net.some_internal_key".
"sotred" ?

> and ruby example to demonstrate code for that:
> {{{
> #!highlight ruby
> url ="http://iweb.dl.sourceforge.net/project/assp/ASSP%20Installation/README.txt"
> key ="http://dl.sourceforge.net.squid.internal/" + url.match(/.*\.dl\.sourceforge\.net\/(.*)/)[1]
> }}}
>
> A similar scenario is with AV updates that will use more then one domain or will IP address as a redundancy.

NP: these are not dynamic due to these things. It is just an
ill-informed old way to perform load balancing across a CDN. Reducing
the load on particular origin servers at expense of increasing the load
total they have to face. ie - it destroys network efficiency rather than
adding.

>
> === Facebook ===
> ''Facebook'' is another subject for Bandwidth abuse but requires a second to think about it.
> As a cache admin you can see that Facebook is one of the top urls in the logs and reports.
> if you see a lot of urls on one domain it doesn't mean that it consumes bandwidth.
> Facebook has a "History of violence" like all social networks and not only in the sense of bandwidth.
>
> one of the problems with Social Networks is "Privacy".
> These networks stores a huge amount of "private" data that cache can lead to "Invasion of privacy".

s/that cache/that when cached by ISP/.

Also, it is not the FB network storage which is a problem here. It is
the dynamic page content.

I'd put:

These networks produce a large volume of responses containing private
data that when cached by an ISP can lead to "Invasion of privacy""

>
> * a case i have seen is that in a misconfiguration on a cache people started getting Facebook and gmail pages of other users
>
> Privacy is an issue that a cache operator should consider very deeply while configuring the server acls(refresh_pattern).
> since Facebook was declared worldwide they indeed made a lot of efforts to be cache friendly using "Cache-Control" and such headers.
> They use XML for updates with headers such as:
> {{{
> Cache-Control: private, no-store, no-cache, must-revalidate
> }}}
> they do you one CDN for video and IMG content at:
> {{{
> http://video.ak.fbcdn.net/...
> }}}
> *** > add here code snip for video url rewriting
> but you must have a key arguments to access the video.
> for IMG they use "many to one CDN" like in:
> {{{
> http://a6.sphotos.ak.fbcdn.net/...
> }}}
> and you can replace the "a6" with a many to one "key".
> *** > add here code snip for img URL rewriting

This is FB performing the "CDN\DNS load balancing" on their static
content. Nothing special or dynamic about the content. Only dynamic
URLs/domain names.

> == Caching Dynamic Content ==
> As i was describing the problem earlier for each of the scenarios we can offer a solution.
>
>
> === Old methods ===
> the methods to cache Dynamic content was mainly directed to youtube videos and CDN.

Specifically at URL de-duplication. The above mentioned sites were just
the drivers to make it a problem that needed solving quicker.
That effort continues in a number of ways (headers Content-MD5, Digest:,
Link:, etc).

There is also the problem of content copying around the web. For
example; how many sites contain their own copy of "jQuery.js" ? images,
icons, scripts, templates, stylesheets, widgets. All these things have
much duplication that reduces cache efficiency.

> ==== Store URL Rewrite ====
> somewhere in Squid 2.X tree the "store_url_rewrite" interface was integrated to solve a cases such as "many to one" urls.

"In [[Squid-2.7]] the SquidConf:store_url_rewrite interface "...

> an example is sourceforge and it can implemented for youtube and others.
> {{{#!highlight ruby
> def main
> while true
> request = gets.split
> case request[0]
> when /^http:\/\/.*\.dl\.sourceforge\.net\/.*/
> puts"http://dl.sourceforge.net.squid.internal/" + request[0].match(/.*\.dl\.sourceforge\.net\/(.*)/)[1]
> else
> puts ""
> end
> end
>
> end
> main
> }}}
>
> Pros:
> *simple to implement.
> Cons:
> *works only with squid2 tree
> *The check is done based only on requested URL. in a case of 300 status code response the URL will be cached and can cause endless loop.
> *A very old and unmaintained proxy version.
duplicate con with the first one.

> *There is no way to interact with the cached key in any of squid cache interfaces such as ICP\HTCP\CAHCEMGR, the object is a GHOST.
s/CAHCEMGR/[[Features/CacheManager|Cache Manager]]/.

>
> *To solve the 300 status code problem a specific patch was proposed but wasn't integrated into squid.
> *The 300 status code problem can be solved by ICAP RESPMOD rewriting.
Indent this a by one more space so it shows up as a part of the * entry
above it.

> ==== Web-server and URL Rewrite ====
> In brief the idea is to use the url_rewrite interface to silently redirect the request to a local web server script.
> in time the script will fetch for squid the url and store the file on HDD or will fetch from HDD the cached file.
> [[http://wiki.squid-cache.org/ConfigExamples/DynamicContent/YouTube#Partial_Solution_1:_Local_Web_Server|the proposed solution in more detail]]
>
> Another same style solution was used by [[http://code.google.com/p/youtube-cache/|youtube-cache]] and later was extended at[[http://code.google.com/p/yt-cache/|yt-cache]]
>
> Pros:
> *works with any Squid version
> *easily adaptable for other CDN
> Cons:
> *no keep-alive support and as result cannot cache youtube with "range" argument requests(will result youtube player stop all the time)
> *There is no support for POST requests at all, they will be treated as GET.(can be changed doing some coding)
> *If two people watch an uncached video at the same time, it will be downloaded by both.
> *It requires a webserver running at all times
> *Cache dir will be managed manually by administrator and not by Squid smart replacement algorithms.
> *cannot be used with tproxy.
Huh? it can be used with TPROXY. The "local" web server just needs to be
a cache_peer.

>
> ==== NGINX as a Cache Peer ====
> in [[http://code.google.com/p/youtube-cache/|youtube-cache]] the author used NGINX web server as a cache_peer and reverse proxy.
> the idea was to take advantage of NGINX ability "proxy_store" as a cache store and "resolver" option to make NGINX be able to do "Forward proxy".
> NGINX has some nice features that allows it to use request arguments as part of "cache store key" easily.
>
> for youtube can be used:
> {{{
> proxy_store "/usr/local/www/nginx_cache/files/id=$arg_id.itag=$arg_itag.range=$arg_range";
> }}}
>
> Pros:
> *works with any Squid version
> *easily adaptable for other CDN
> Cons:
> *no keep-alive support and as result cannot cache youtube with "range" argument requests(will result youtube player stop all the time)
> *A request will lead to a full file download and can cause DDOS or massive bandwidth consumption by the cache web-server.
> *It requires a webserver running at all times
> *Cache dir will be managed manually by administrator and not by Squid smart replacement algorithms.
> *cannot be used with tproxy.
>
> === Summery of the ICAP solution ===
> The "problem" of newer squid versions then 2+ is that the store_url_rewrite interface wasn't integrated and as a result most of the users used the old squid version.
> others have used the url_rewrite and web-server way.
> many have used [[http://cachevideos.com/|videocache]] that is based on the same idea because it has updates, support and other features.
>
> this resulted Squid servers to serve files from a local NGINX\APACHE\LIGHTHTTPD that resulted a very nasty cache maintainability problem.
>
> many cache admins gained youtube videos cache but lost most of squid advantages.
>
> The idea is to let squid(2 instances) do all caching fetching etc instead of using a third party cache solutions and web-servers.
> So With a long history of dynamic content analysis at work i had in mind for a long time the idea but just recently Tested and implemented it.
>
> The solution i implemented was meant for newer Squid version 3+ can be implemented using either one of two options ICAP server or url_rewrite while ICAP has many advantages.
> it requires:
> * 2 squid instances
> * ICAP server\url_rewrite script
> * very fast DB engine(MYSQL\PGSQL\REDIS\OTHERS)
>
> what will it do? ''Cheat Everyone in the system!!''.
> ICAP and url_rewrite has the capability to rewrite the url transparently to the client so one a client request a file squid,
> squid 1 will issue by acls ICAP REQMOD(request modification) from ICAP server.
> pseudo for ICAP code:
> ##start
> analyze request.
> if request fits criteria:
> extract from request the needed data (from url and other headers)
> create an internal "address" like"http://ytvideo.squid.internal/somekey"
> store a key pair of the original url and the modified url on the db.
> send the modified request to squid.
> ##end
> on squid 1 we pre-configured a cache_peer for all dstdomain of .internal so the rewritten url must be fetched through squid 2.
>
> squid 2 then gets the request for"http://ytvideo.squid.internal/somekey" and passes the request to the ICAP server.
> the ICAP server in time fetch the original URL from DB and rewrites the request to the original origin server.
>
> The status now is:
> client thinks it's fetching the original file.
> squid 1 thinks it's fetching the"http://ytvideo.squid.internal/somekey" file
> squid 2 feeds the whole network one big lie but with the original video.
>
> The Result is:
> squid 1 will store the video with a unique key that can be verified using ICP\HTCP\CACHEMGR\LOGS etc.
> squid 2 is just a simple proxy(no-cache)
> ICAP server coordinates the work flow.
>
> Pros:
> *cache managed by squid algorithms/
> *should work on any squid version support ICAP\url_rewrite.(tested on squid 3.1.19)
> *can build key based on the URL and all request headers.
> Cons:
> *depends on DB and ICAP server.
> *
>
> === Implementing ICAP solution ===
> requires:
> *squid with icap support
> *mysql DB
> *ICAP server (i wrote [[https://github.com/elico/echelon|echelon-mod]] specific for the project requirements)
> I also implemented this using GreasySpoon ICAP server [[https://github.com/elico/squid-helpers/tree/master/squid_helpers/youtubetwist|can be found at github]]
> squid 1:
> {{{
> acl ytcdoms dstdomain .c.youtube.com
> acl internaldoms dstdomain .squid.internal
> acl ytcblcok urlpath_regex (begin\=)
> acl ytcblockdoms dstdomain redirector.c.youtube.com
> acl ytimg dstdomain .ytimg.com
>
> refresh_pattern ^http://(youtube|ytimg)\.squid\.internal/.* 10080 80% 28800 override-lastmod override-expire override-lastmod ignore-no-cache ignore-private ignore-reload

Passing the response through ICAP you can just have ICAP set the expiry
information explicitly on the responses. Nothing to override, no
refresh_pattern at all. The ICAP 206 "continue" status makes this efficient.

>
> maximum_object_size_in_memory 4 MB
>
> #cache_peers section
> cache_peer 127.0.0.1 parent 13128 0 no-query no-digest no-tproxy default name=internal
> cache_peer_access internal allow internaldoms
> cache_peer_access internal deny all
>
> never_direct allow internaldoms
> never_direct deny all
>
> cache deny ytcblockdoms
> cache deny ytcdoms ytcblcok
> cache allow all
>
> icap_enable on
> icap_service_revival_delay 30
>
> icap_service service_req reqmod_precache bypass=1 icap://127.0.0.2:1344/reqmod?ytvideoexternal
> adaptation_access service_req deny internaldoms
> adaptation_access service_req deny ytcblockdoms
> adaptation_access service_req allow ytcdoms
> adaptation_access service_req deny all
>
> icap_service service_ytimg reqmod_precache bypass=1 icap://127.0.0.2:1344/reqmod?ytimgexternal
> adaptation_access service_ytimg allow ytimg img
> adaptation_access service_ytimg deny all
> }}}
>
> squid 2
> {{{
> acl internalyt dstdomain youtube.squid.internal
> acl intytimg dstdomain ytimg.squid.internal
> cache deny all
>
> icap_enable on
> icap_service_revival_delay 30
>
> icap_service service_req reqmod_precache bypass=0 icap://127.0.0.2:1344/reqmod?ytvideointernal
> adaptation_access service_req allow internalyt
> adaptation_access service_req deny all
>
> icap_service service_ytimg reqmod_precache bypass=0 icap://127.0.0.2:1344/reqmod?ytimginternal
> adaptation_access service_ytimg allow intytimg
> adaptation_access service_ytimg deny all
> }}}
>
> MYSQL db
>
> {{{
> #i have used mysql db 'ytcache' table 'temp' with user and password as 'ytcache' with full rights for localhost and ip 127.0.0.1
> create a memory table in DB with two very long varchar(2000) fields.
> create give a user full rights to the db.
> # it's recommended to truncate the temp memory table at least once a day because it has limited size.
> }}}
>
> ICAP SERVER
>
> my ICAP server can be downloaded from : [[https://github.com/elico/echelon|My github]]
> the server is written in ruby and tested on version 1.9.
> required for the server:
> {{{
> "rubygems"
> gem "bundler"
> gem "eventmachine"
> gem "settingslogic"
> gem "mysql"
> gem "dbi"
> }}}
> there is a settings file at config/settings.yml
>
> notice to setup local IP address to the server in the config file.
>
> i have used IP 127.0.0.2 to allow very intense stress tests with a lot of open port.
Received on Tue Jul 10 2012 - 08:21:07 MDT

This archive was generated by hypermail 2.2.0 : Tue Jul 10 2012 - 12:00:03 MDT