Re: [squid-users] refresh pattern questions from Amos Jeffries on 2013-07-14 (squid-users)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Mon, 15 Jul 2013 15:57:07 +1200

On 15/07/2013 6:31 a.m., Joshua B. wrote:
> I have some questions related to refresh pattern options
>
> First, since “no-cache” now seems in-effective with http 1.1, what would be a possible way to force an object to cache using both standards of html 1.0 and 1.1? If it’s not possible, then is there any plans to implement in a future version of squid?

You are talking about "ignore-no-cache"? I'm not sure you understand
exactly what it did and what the new Squid do instead.

Simply put:
There is *no* HTTP/1.0 equivalent for "no-cache" on responses. The
best one can do is set an Expires header.

Squid-2.6 to 3.1 had some small HTTP/1.1 support but were unable to
perform the tricky revalidation required for handling "no-cache"
responses properly so they used to treat no-cache as if it were
"no-store" and prevent caching of those responses.

==> "ignore-no-cache" used to flip that behaviour and cause them to be
stored. This resulted in a great many objects being cached for long
periods and re-sent to clients from cached copies which were outdated
and might cause big UX problems (thus the warnin when it was used).

Squid-3.2 and later have far better HTTP/1.1 support *including* the
ability to revalidate "no-cache" responses properly. So these versions
of Squid *do* store the responses with "no-cache" by default. They then
send an IMS request to the server to verify the HIT is up-to-date -
resolving all those UX problems.
==> the useful effect of "ignore-no-cache" does not need any config
option now, and the bad side-effects ... do you really want them?

** If you have a server and want to follow the "old" behaviour of
no-cache responses. You should already have been using "no-store" instead.

** If you have a server and want to follow the "old" behaviour when
"ignore-no-cache" was used. You should not have been sending "no-cache"
on responses to begin with.

> Secondly, why is there a limit of 1 year on an “override” method? A lot of websites make it such a pain to cache, and even go as far as (literally) setting the date of their files back to the early 1900s. Them doing this makes it feel impossible to cache the object, especially with squids own limitation.

To prevent 32-bit overflow on the numerics inside Squid. Going much
further out the number inverts and you end up with objects being evicted
from cache instead of stored.
The whole refresh_pattern calculations need to be 64-bit upgraded and
the override-* and ignore-* options reviewed as to what they do versus
what the 1.1 spec allows to happen by default (like no-cache just got done).

You ever wonder why those websites go to such extreme lengths? Why they
care so much about their client getting recently updated content?

> With all this said, IS there an effective way to cache content when the server doesn’t want you to? So there would be like, a GAURANTEED “tcp_hit” in the log. Even with a ? in the url of the image, so squid would consider anything with a ? after it the same image. For example: website.com/image.jpg?1234567890
> It's the exact same image (I've examined all in the logs that look like this), but they're making it hard to cache with the ? in the url, so I'd like to know if there's a way around this?

1) Remove any squid.conf "QUERY" ACL and related "cache deny" settings
which Squid-2.6 and earlier required.
That includes the hierarchy_stoplist patterns. These are the usual
cause of dynamic content not caching in Squid-2.7+.

2) Try out the upcoming 3.4 (3.HEAD right now) Store-ID feature for
de-duplicating cache content.
You can also in older versions re-write the URL to strip the numerics.
In some ways this is safer as the backend then becomes aware of the
alteration and smart ones can take special action to prevent any massive
problems if you accidentally collide with a security system (see below).

How do you know that "website.com/image.jpg?1234567890" is not ...
  ... a part of some captcha-style security system?
  ... the background image for a login button which contains the users name?
  ... an image-written bank account number?
  ... an image containing some other private details?
  ... a script with dynamic references to other URLs?

To be sure that you don't make that type of mistake with all the many,
many ways of using URLs you have to audit *every single link on every
single website which your regex pattern matches*, .. or do the easy
thing and let HTTP caching controls work as they are supposed to work.
Send and annoyed email to the site in question requesting that they fix
their URL scheme, highlighting that they get *free* bandwidth in
exchange for the fix. Sites do change - Facebook is a good case study to
point at: as they scaled up they had to fix their cacheability and
HTTP/1.1 compliance to help save costs exploding.

Amos
Received on Mon Jul 15 2013 - 03:57:18 MDT

This archive was generated by hypermail 2.2.0 : Mon Jul 15 2013 - 12:00:26 MDT