Re: [squid-users] Accelerating proxy not matching cgi files from Amos Jeffries on 2011-08-23 (squid-users)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Tue, 23 Aug 2011 18:53:17 +1200

On 23/08/11 07:43, Mateusz Buc wrote:
> Hello,
>
> at the beginning I would like to mention that I've already search for
> the answer to my question, found similar topics, but none of them
> helped me to completely solve my problem.
>
> The thing is I have monitoring server with cgi scripted site on it.
> The site fetches various data and generates charts 'on-the-fly'. Now
> it is only available via HTTPS with htaccess-type authorization.
>
> The bad thing is that it is quite often browsed and everytime it gets
> HTTP requests, it has to generate all of the charts (quite a lot of
> them) on the fly, which not only makes loading the page slow, but also
> affects server's performance.
>
> There 4 most important things about the site:
> * index.cgi - checks current timestamps and generates proper GET
> requests to generate images via gen.cgi

You have an internal part of your site performing GET requests?

Or did you mean it generates an index page containing a set of volatile
URLs for IMG or A tags?

> * gen.cgi - it receives paramers via GET from index.cgi and draws charts
> * images ARE NOT files placed on server, but form of gen.cgi links
> (e.g. "gen.cgi?icon,moni_sys_procs,1314022200,1,161.6,166.4,FFFFFF...")

Does not matter to Squid where the files are. Or even that they are files.

> * images generation links contain most up-to-date timestamp for every
> certain image

Ouch. Bad, bad, bad for caching. Caches only works when the URLs are
stable with repeated calls to the same ones.

To get good caching the URL design should only contain parameters
relevant to where the data is sourced or its content structure. Whatever
details could produce two different objects in two _simultaneous_
parallel requests. Everything else is potential problems.

The reason you may want times to be in the URL is for a wayback machine
where time is an important static coordinate in the location of something.

Your Fix part 1: - sending cache friendly meta data.

** Send "Cache-Control: must-revalidate" to require all browsers and
caches to double-check their content instead of making guesses about
storage times.

** Send "Last-Modified: " with HTTP formated timestamp. Correct format
is important.

At this point incoming requests will either be requesting brand new
content or have an If-Modified-Since: header containing the cached
objects Last-Modified: timestamp.

NOTE: You will not _yet_ see any reduction in the 200 requests.
Potentially you might actually see an increase as "must-revalidate"
causes middleware caches to start working better.

Your Fix part 2: - reducing CPU intensive 200 responses.

It is up to your gen.cgi whether it responds quickly with a simple 304
no-change, or creates a whole new object for a 200 reply.
This decision can now be based on the If-* information the clients are
sending as well as the URL.

  ** Pull the timestamp from If-Modified-Since header instead of the URL.
   - If there is no such header the client is requiring a new graph.
   - If the timestamp matches or is newer than the graph the URL
describes, send 304 instead.

** Remove the timestamp completely from your URLs unless you want that
wayback ability. In which case you may as well make it visible and easy
for people to type in URLs manually for particular fetches.

At this point your gen.cgi script starts producing a mix of fast 304
responses amidst the slow 200 ones and both your bandwidth and CPU
graphs should drop.

Your fix part 3: - KISS simplicity

Your URLs should now be changing far less, possibly even to the point
that they are completely static. The less URLs change the better caching
efficiency you get.

Your index.cgi can be made simpler now or possibly replaced with a
static page. It only needs to change when the type or location changes
and affect the graph URLs.

As a followup you can move on to experiments with the other cache
control headers like max-age to find values (per-graph URL) suitable for
avoiding gen.cgi calls completely for a suitable period.

If you are able to generate an ETag value and validate it easily without
much work. For example Etag as MD5 hash of a raw un-graphed data file,
or a hash of the URL+timestamp. Then you should also add ETag and other
If-* header support to the scripts.
That would allow several more powerful caching features to be used by
Squid on top of the simple 304/200 savings. Such as partial ranges and
compression.

>
> What I want to do is to set another server in the middle, which would
> run squid and act as a transparent, accelerating proxy. My main
> problem is that squid doesn't want to cache anything at all. My goal
> is to:
>
> * cache index.cgi for max 1 minute time - since it provides important
> data to generate charts
> * somehow cache images generated on the fly as long, as there aren't
> new one in index.cgi (only possible if timestamp has changed)
>
> To make it simpler to develop, I've temporary disabled authorization,
> so my config looks like:
> #################################################################
> http_port 5080 accel defaultsite=xxxx.pl ignore-cc
>
> # HTTP peer
> cache_peer 11.11.11.11 parent 5080 0 no-query originserver name=xxxx.pl
>
> hierarchy_stoplist cgi-bin cgi ?

The above config line prevents the cache_peer source being used for URLs
containing those strings. You can safely drop the line.

>
> refresh_pattern (\.cgi|\?) 0 0% 0

Okay. Check the case sensitivity of your web server, if its not case
sensitive you will need to re-add the -i to prevent XSS problems.

> refresh_pattern . 0 20% 4320
>
> acl our_sites dstdomain xxxx.pl
> http_access allow our_sites
> cache_peer_access xxxx.pl allow our_sites
> cache_peer_access xxxx.pl deny all
> ##################################################################
>
> Unfortunately, access.log looks in this way:
>
> 1314022248.996 66 127.0.0.1 TCP_MISS/200 432 GET
> http://xxxx.pl/gen.cgi? - FIRST_UP_PARENT/xxxx.pl image/png
> 1314022249.041 65 127.0.0.1 TCP_MISS/200 491 GET
> http://xxxx.pl/gen.cgi? - FIRST_UP_PARENT/xxxx.pl image/png
> 1314022249.057 65 127.0.0.1 TCP_MISS/200 406 GET
> http://xxxx.pl/gen.cgi? - FIRST_UP_PARENT/xxxx.pl image/png

NP: every unique URL is a different object in HTTP. Cache revalidation
cannot compare object at URL A against object at URL B. Only the origin
can do that sort of thing, and yours always produces 200 when asked.

> Could someone tell me how to configure squid to meet my expactations?

Squid is configured by default to meet your expectations about caching.
It just requires sensible cache-friendly output from the server scripts.
See above.

Some great tutorials on URL design and working with caching can be found at:
http://warpspire.com/posts/url-design/
http://www.mnot.net/cache_docs/

Amos

-- 
Please be using
   Current Stable Squid 2.7.STABLE9 or 3.1.14
   Beta testers wanted for 3.2.0.10

Received on Tue Aug 23 2011 - 06:53:25 MDT

This archive was generated by hypermail 2.2.0 : Tue Aug 23 2011 - 12:00:02 MDT