Re: [squid-users] caching for 60 minutes, ignoring any header from Eliezer Croitoru on 2013-09-23 (squid-users)

From: Eliezer Croitoru <eliezer_at_ngtech.co.il>
Date: Tue, 24 Sep 2013 08:09:34 +0300

Hey Ron,

I added notes near the quotes.

On 09/23/2013 12:13 PM, Ron Klein wrote:
> I'll describe the real scenario in a more detailed way, but I can't
> disclose all of it.
>
It's OK since it's a public list.

> There are a few machines, let's name them M1 to M9, that are processing
> data.
OK
> From time to time, those machines should make HTTP requests to external
> servers, that are business partners. All of these HTTP requests are in
> the same format and have the following request headers:
> * User-Agent: undisclosed_user_agent
> * |Accept-Encoding: gzip, deflate
> * Host: the_hostname_of_the_external_server
> |* Expect: [nothing]
> * Pragma: [nothing]
> That's it, nothing more, nothing less.
OK and what are the server response headers?
These are matter from many angles of the problems..

> On those servers, as we agreed, there should be an xml file in a
> specific path. For instance:
> http://foo.com/bar/daily-orders.xml
Which let say we can describe a case scenario for that from mighty
google for example?
> (I can't disclose the exact path here)
NP CIA is important!
> These files are re-generated from time to time. How often? I can't tell,
> and it's not up to me.
OK
> Now, since there are a few thousands of business partners that generate
> these xml files for my business, I thought that caching these xml files
> in a single machine would be a good idea, since it should reduce
> external traffic.
What is called forward proxy??
> Therefore, I installed Squid3 on a specific machine, and updated M1-M9
> HTTP clients to use the proxy server instead of directly fetching the
> xml files.
OR intercept them.. all up to you..

> For business considerations, when an xml file is cached, I don't need it
> to be as fresh as possible. I want to reduce outgoing traffic as much as
> I possible.
squid v3.1 and 3.3 works a bit differently and also their logs are kind
of different about it.
> My business partners don't care about it, too. They also don't want to
> change anything at all in their web servers. That's a fact I can't
> change what so ever.
Which is a major and bad habit of many..
>
> All I want is to have a local copy of the xml file for every external
> server, that would be considered as "fresh" from T0 to T0+60minutes. For
> my business needs, that's what I need. And if some of the xml files are
> cached somewhere else, which is a rare scenario for this case, then I
> can ignore that (business-wise)
This is what is so called "mirror" site.
and in a proxy world it would be considered a "stale" cached object or
"offline copy".
>
> I initially thought that the favicons example would simplify things
> (since a lot of web sites have favicons, and it's a common knowledge),
> but I wasn't aware of the special case of favicons. I apologize for the
> time wasted about my simplified example.
it's nice to simplify it since not all can see the whole picture from
one favicons..
>
> I hope I shed more light about the subject.
>
Yes indeed.
I would give an example case that can help you understand the complexity
of the issue and also help you.
two responses that can be indicated using redbot:
http://redbot.org/?uri=http%3A%2F%2Fwww.google.co.il%2Ffavicon.ico
http://redbot.org/?uri=http%3A%2F%2Fwww.google.co.il%2Findex.html

The above tool allows a simulation of a simple request while showing the
differences between them.

the favicon.ico is a nice and simple target to cache if the server serve
it in a simple way.
When the server starts to complicate things in the application level and
to change response headers for couple clients it's another story.
The above might be the reason for your partner to not change their
application.

Also the above put you in a situation where you might not be able to
cache the object(file) in a simple way that squid offers out on the blue
since squid is a *general* http cache proxy which might not meet a very
very very deep complexity issues of the developer of the site.

With the above in stake and since these XML files are only for machines
M1-M9(right?) the basic way will be to "inject" these files into the
cache or store them on a dedicated *offline cache server*.
You are not the first to ask the above but squid is an *online* cache
and not a store mechanism.

Since you have a specific issue with specific clients and specific issue
with specific servers you will need an expert to use squid for logs
using the debug_options Amos suggested that will document the above case
to make sure that the right solution for your very specific case
scenario will be delivered considering the *local* business effect of
the so called *cache* for favicon.ico.

I have posted before that a cache maintainer needs to think that cache
is not the only option for all cases and re-validation is not such a bad
thing.
In squid 3.3.X if the exact simple requests result in a simple response
which is cachable a re-validation is expected and re-download can be the
right choice which is not a bad result since we are talking about money
and real life right?(this is why all the ash ash about these xml files).

When you will have more details on the couple different responses and
more details on the request feel free to send me or anyone of the
project a PM.
You can mangle them but notice that the less information we know there
is more things that can go wrong and you will get the unexpected result.

Eliezer

> Thanks!
>
> On 23-Sep-13 11:21, Amos Jeffries wrote:
>> On 23/09/2013 7:21 p.m., Ron Klein wrote:
>>> My example of favicons was to simplify the question. The real case is
>>> different.
>>
>> Then please tell us the real details. In full if possible.
>> favicon is one of the special-case type of URLs and like Eliezer and I
>> already mentioned there are some specific usage for them which
>> directly causes problems with your stated goals or even using it as a
>> simplified test case. Perhapse your real case is also using similar
>> special-case URLs with other problems - but nobody can assist with
>> that if you hide details.
>>
>> So please at least avoid "favicon" references for the remainder of
>> this discussion. You have indicated that they are irrelevant.
>>
>>> I want to cache all "favicons" (that is, other resources, internally
>>> used) for 60 minutes.
>>> For a given "favicon", I'd like to have the following caching policy:
>>
>> Anywho, ignoring all the protocol and UA special-case behaviour
>> factoids because you said that was a fake example...
>>
>>> The period of 60 minutes should start when the first consumer
>>> consumes the favicon. Let's mark the time for that first request as
>>> T0 (T Zero).
>>
>> Your policy assumes and requires that your proxy is the only one
>> between users and the origin server. If your upstream at any stage
>> have a proxy the object age will not meet your T0 criterion - this is
>> why Last-Modified and Age headers are used in HTTP. To indicate an
>> objects time since creation regardless of whether the object might
>> have been newely generated by the origin, altered by an intermediary
>> or stored for some time by an intermediary or the origin itself
>> (server-side caching or static archive).
>>
>> FWIW: I am working with a client at present who want to do this type
>> of caching for every URL in existence, but only for a few minutes.
>> They have a growing list of domain names where the policy has to be
>> disabled due to problems it causes to user traffic.
>>
>>> During T0 until T0+60minutes, this favicon should be considered as
>>> "fresh", in terms of caching.
>>
>> The single value of 60 in the refresh_pattern line "max" field along
>> with override-expire override-lastmod meets the above criteria.
>>
>> However as I said earlier, freshness does not guarantee a HIT. There
>> are many other HTTP features which need to be considered on top of
>> that freshness to determine whether it HITs or MISSes.
>>
>>> After T0+60minutes, this favicon should be considered as "stale", in
>>> terms of caching, and should be re-fetched by Squid, upon request.
>>
>> There is no such thing as a refetch in HTTP caching.
>> There is only MISS or REFRESH. The revalidation may happen
>> transparently at any time and you never see it.
>>
>>> The favicon would be cached even if the original server explicitly
>>> instructed not to cache nor store the favicon.
>>
>> The refresh_pattern ignore-private and ignore-no-store meet that
>> criteria in a way. The object result from the current transaction will
>> be left in the cache regardless of what might happen to it on any
>> future or past ones.
>>
>>> Yes, I know it might be considered a bad practice,
>>
>> As stated your caching policy is not particularly bad. The use/need of
>> ignore-private and ignore-no-store is the only bad thing and the
>> strong sign that you are possibly violating some law...
>>
>>> and perhaps illegal to some readers,
>>
>> ... so consulting a lawyer is recommended.
>>
>> We provide those controls in Squid for specific use-cases. Yours may
>> or may not be one of those it is hard to tell from a fake example.
>>
>>> but I assure you that the other servers (the real web servers) that
>>> provide the responses, are business partners and they gave me their
>>> approval to override their caching policy. However, they don't want
>>> to change their configuration and it's totally up to me to create my
>>> caching layer.
>>
>> They may not be willing to alter their public cache controls, but
>> Surrogate-Control features available in Squid offer an alternative
>> targeted caching policy to be emitted by their servers for your proxy.
>> This assumes they are willing to setup such alternative policy and you
>> configure your proxy as a reverse-proxy for their traffic.
>>
>> Your whole problem would be solved by the upstream simply sending:
>> Surrogate-Control: max-age=3600;your_proxy_fqdn
>>
>>> And another thing: the clients are not web browsers. The clients
>>> consuming these resources ("favicons" for sake of simplicity) are
>>> software components using HTTP as their transport protocol.
>>>
>>> Thanks for any advice on the subject.
>>
>> Well...
>> you have a set of URLs with undefined behaviour differences from the
>> notably special-case ones in your example ...
>> being fetched by clients with undefined but very big behaviour
>> differences from the UA which would be fetching your example URLs ...
>>
>> ... and you want us to help with specific details about why your
>> config is not working as expected?
>> As the old cliche goes "insufficient data".
>>
>> Amos
>>
>
Received on Tue Sep 24 2013 - 05:09:50 MDT

This archive was generated by hypermail 2.2.0 : Tue Sep 24 2013 - 12:00:04 MDT