Re: [squid-users] a miss threshold for certian times of a specified webpages from Amos Jeffries on 2012-07-02 (squid-users)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Tue, 03 Jul 2012 00:36:04 +1200

On 2/07/2012 6:13 p.m., Mustafa Raji wrote:
>
> --- On Mon, 7/2/12, Amos Jeffries <squid3_at_treenet.co.nz> wrote:
>
>> From: Amos Jeffries <squid3_at_treenet.co.nz>
>> Subject: Re: [squid-users] a miss threshold for certian times of a specified webpages
>> To: "Mustafa Raji" <mustafa.raji_at_yahoo.com>
>> Cc: squid-users_at_squid-cache.org
>> Date: Monday, July 2, 2012, 12:29 AM
>> On 02.07.2012 10:24, Mustafa Raji
>> wrote:
>>> --- On Sun, 7/1/12, Amos Jeffries wrote:
>>>
>>>> From: Amos Jeffries
>>>> On 1/07/2012 1:04 a.m., Mustafa Raji
>>>> wrote:
>>>>> hello
>>>>>
>>>>> is there an option that limits number of
>> access to
>>>> webpage before it can be consider as a cachable and
>> caches
>>>> the webpage
>>>>> example
>>>>> some option like a ( miss threshold ) = 30
>>>>> so the user requests the page for a 30 time
>> and this
>>>> requests of the objects can by consider as a miss
>> requests,
>>>> after the user request reaches this threshold (30),
>> then
>>>> squid can consider this webpage objects as a
>> cachable
>>>> objects and began to cache these objects
>>>>
>>>> Uhm, why are you even considering this? What
>> benefit
>>>> can you gain by wasting bandwidth and server CPU
>> time?
>>>> HTTP servers send out Cache-Control details
>> specifying
>>>> whether and for how long each object can be cached
>> for.
>>>> Replacing these controls (which are often carefully
>> chosen
>>>> by the webmaster) with arbitrary other algorithms
>> like the
>>>> one you suggest is where all the trouble people
>> have with
>>>> proxies comes from.
>>>>
>>>> Amos
>>>>
>>>>
>>> thanks Amos for your reply
>>> what about an option that can consider the first 60
>> http requests for
>>> google webpage as a miss, and after the 60 requests the
>> google webpage
>>> can be allowed to be cached, is there any option in
>> squid to do this,
>>> of course without time limitation
>>
>> No because HTTP is stateless protocol where each requests
>> MUST be considered in isolation from every other request.
>> Squid can handle tens of thousands of URL per second, each
>> URL being up to 64KB line with multiple letters at each byte
>> position. Keeping counters for every unique URL received by
>> Squid over an unlimited time period would be as bad or worse
>> than simply caching in accordance with HTTP design
>> requirements.
>>
>> Which is why I asked; Why do you think this is a good idea?
>> what are you getting out of it? what possible use would
>> outweigh all the wasted resources?
>>
>>
>> NP: the google webpage (any of them including the front
>> page) changes dynamically, with different displays depending
>> on user browser headers, Cookies and on Geo-IP based
>> information. Storing when not told to is a *bad* idea.
>> Discarding when told storage is possible is a waste of
>> bandwidth.
>>
>> Amos
>>
> thanks Amos for your helpful support
> really i need it for just test, a method to calculate the increasing of squid box hardware (disk space) to get a highly hit ratio,
>
> finding how much hardware i can use to get hit ratio with the calculation of hardware worth to the hit ratio, i hope i was clear in my explanation

Oh. Hit ratio is not something you can test like that. Attempts at
partial-caching will actively *reduce* it.

> simple example/
>
> if i add a hard of 500 gigabyte i can reach a hit ratio 20% this is worth of adding this hardware
> if i add a hard of 500 gigabyte i can reach 2% hit ratio it's not worth to add this hardware

>
> if course in this community, there is a good method to calculate that please can you show me how to do that if you have time or just give me a links to a webpage explain how to do that

Hit ratio is the ratio of cacheable to non-cacheable content in your
HTTP traffic flow. Imagine the cache storage size as a "window" of
traffic over which this HIT ratio is accumulated. There is some complex
feedbacks, in that each 1% of HIT increases the actual traffic window
size by 1% and things like that.

So as you can see, the best way to calculate HIT ratio is to take a
measure of your users HTTP traffic (Twice as large as the proposed 500GB
cache size [in case you are lucky enough to get ~50% hit ratio]) and see
how many of them were repeat requests for the same URL. That count as a
percentage of the total requests is a rough upper-limit HIT ratio for
your users. You can guess that a bit of those were non-cacheable, but
long-term due to that increasing-window effect your request HIT ratio
will trend around that number.

NOTE: unfortunately I'm not aware of any tools that make this
calculation easy. If you have an existing Squid, you can do it from the
historic access.log. Otherwise you are left with tricky TCP dumps etc.

But effectively, the larger your cache storage size, the more HIT
traffic you can achieve. Nobody I know of is proxying real user traffic
and getting less than 5% HIT ratio without something (like small or no
disk cache) limiting the traffic than can be HIT on. Current Squid are
getting 10%-20% ratios out of the box for ISP that setup a cache and
leave it without any special attention. The special case mobile networks
with tuning are achieving up to 55% in one ISP.
... and we are constantly doing things to improve cacheability, from
adjusting Squid cache control algorithms to advocating cache-friendly
practices by web designers.

Amos
Received on Mon Jul 02 2012 - 12:36:14 MDT

This archive was generated by hypermail 2.2.0 : Mon Jul 02 2012 - 12:00:02 MDT