Re: [squid-users] squid cache prob: won't cache a 'pdf'

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Thu, 07 Apr 2011 13:31:43 +1200

 On Wed, 06 Apr 2011 12:34:26 -0700, Linda Walsh wrote:
> I was downloading some product documentation from the
> documentation section on:
>
> http://www.lsi.com/channel/products/jbods/sata_sas_jbods/630j/index.html
>
> Specifically, I tried:
>
> http://www.lsi.com/DistributionSystem/User/AssetMgr.aspx?asset=54432
> http://www.lsi.com/DistributionSystem/User/AssetMgr.aspx?asset=54841
> http://www.lsi.com/DistributionSystem/User/AssetMgr.aspx?asset=54435
>
> They all load smallish pdf's:
> (from log monitor:)
> +63.50 346ms; ln=473 (1.3K/7.4) TCP_MISS/200 <Athenae2 [HEAD
> http://www.lsi.com/DistributionSystem/User/AssetMgr.aspx?asset=54841
> -
> HIER_DIRECT/www.lsi.com application/pdf ]
> +7.01 220ms; ln=462 (2.1K/65.9) TCP_MISS/200 <Athenae2 [HEAD
> http://www.lsi.com/DistributionSystem/User/AssetMgr.aspx?asset=54435
> -
> HIER_DIRECT/www.lsi.com application/pdf ]
> +6.21 23914ms; ln=5051477(206.3K/795.4K) TCP_MISS/200 <Athenae2
> [GET
> http://www.lsi.com/DistributionSystem/User/AssetMgr.aspx?asset=54432
> -
> HIER_DIRECT/www.lsi.com application/pdf ]
>
> ----

 The first two requests here are HTTP "HEAD"[er] requests, which do not
 actually retrieve any of the body to be cached but could cache the
 headers that come back. The third is a GET which might be cached, but
 will be a MISS if the existing cache only has HEAD[er] details.

>
> Now I've tried several mods in my squid.conf file (how do you
> squid to display it's version? I tried --version, but
> no go) -- am running something like Squid 3.2.0.4 (at least
> it's the last entry in the 'Changelog' on disk; it signs on
> as "Head-BZR").

 Thanks, that is about as good as you are likely to get. ('3.2.0.4 plus
 some patches').

 FYI: "squid -v" for the version and build info. But in your case that
 would show the same 3.HEAD-BZR for version.

 NP: It gets a bit difficult to track in the HEAD code. The daily
 snapshots have a 3.HEAD-$date, but anything more live than that requires
 --build-info parameter added with something to identify it.

>
> Things I have tried:
> 1) commenting out:
> 'acl QUERY urlpath_regex cgi-bin \?'
> 'cache deny QUERY'

 Good. Using that would absolutely prevent caching those requests.
 Regardless of any other problems.

 This change has made the cacheability go from NO to MAYBE. Other
 factors (like the HEAD/GET difference) will still make the MAYBE go to a
 definite decision.

> 2) adding back:
> 'acl QUERY urlpath_regex cgi-bin \?'
> 'cache allow QUERY' ## Note changed it to 'allow'

 Should not have any effect.

> 3) commenting out:
> 'hierarchy_stoplist cgi-bin ?'
> Note -- didn't think I needed this, as I had no other
> caches I was querying from, but a comment further on down
> under 'nonhierarchical_direct', said,
>
> "By default, squid will send any non-hierarchical
> requests (matching hierarchy_stoplist or not cachable
> request type) direct to origin servers. If you
> set this to off, Squid will prefer to send these request
> to parents."
>
> I took the comment to indicate that if something was in the
> hierarchy_stoplist, it would also prevent caching, thus my try
> in disabling it

 These only come into affect if fetching from a peer. Removing
 hierarchy_stoplist will allow matching peer-sourced replies to maybe be
 cached here and maybe in the peer.

> 4) In my refresh patterns, I have entries for ftp and gopher
> and one for ".": (which presumably would match everything else):
>
> refresh_pattern . 0 20% 4320
>
> To that line I have tried adding a bunch of keywords
> (note, it's all 1 line in the squid.conf file, no backslashes):
>
> refresh_pattern . 0 20% 4320 ignore-no-store \
> ignore-no-cache ignore-private ignore-auth override-expire \
> reload-into-ims
>
> The only ones I haven't tried yet are 'refresh-ims',
> 'override-expire' and 'override-lastmod', but those shouldn't
> be needed and might cause more headaches than it is worth.
>
> Is there something I'm missing? This seems like it should be
> 'simple'.

 You need to look at the actual PDF request and reply headers. That will
 tell you what is going on and which (if any) of the overrides are
 useful.

 Your log below has those Cache-Control details.

>
> Relevant log file entries are below (access, cache, store...)
>
>
>
> The full entry (from access.log) from one of the above shows:
> ------------------------------------------------------------
> 1302116600.765 108 192.168.3.140 TCP_MISS/200 468 HEAD
> http://www.lsi.com/DistributionSystem/User/AssetMgr.aspx?asset=54432
> -
> HIER_DIRECT/www.lsi.com application/pdf [Host:
> www.lsi.com\r\nUser-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0;
> en-US; rv:1.9.2.16) Gecko/20110319 Firefox/3.6.16\r\nAccept:
>
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8,application/json\r\nAccept-Language:
> en,en-us;q=0.5\r\nAccept-Encoding: gzip,deflate\r\nAccept-Charset:
> UTF-8,*\r\nKeep-Alive: 1800\r\nProxy-Connection: keep-alive\r\n]

 Okay. No reason not to cache.

> [HTTP/1.1 200 OK\r\nDate: Wed, 06 Apr 2011 19:03:16 GMT\r\nServer:
> Microsoft-IIS/6.0\r\nX-Powered-By: ASP.NET\r\nX-AspNet-Version:
> 2.0.50727\r\nContent-Disposition: attachment;
> filename=JBOD_Enclosures_Guide_080310.pdf\r\nSet-Cookie:
> ASP.NET_SessionId=vgzglkahj1njarzzn4yooun3; path=/;
> HttpOnly\r\nCache-Control: private\r\nContent-Type:
> application/pdf\r\nContent-Length: 5051083\r\n\r]

 Marked explicitly as "private" - aka cannot be cached by any middleware
 proxy (such as Squid) which may send it to other users. May be cached by
 a personal cache such as the browser storage.

 To me it looks like incorrect website Cache-Control:. Although if you
 require a login to fetch that doc, then it could be right.

 <snip other logs>

 Amos
Received on Thu Apr 07 2011 - 01:31:48 MDT

This archive was generated by hypermail 2.2.0 : Thu Apr 07 2011 - 12:00:03 MDT