Re: [squid-users] Squid with PHP & Apache from Amos Jeffries on 2013-11-27 (squid-users)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Thu, 28 Nov 2013 00:28:30 +1300

On 27/11/2013 5:30 p.m., Ghassan Gharabli wrote:
> On Tue, Nov 26, 2013 at 5:30 AM, Amos Jeffries wrote:
>> On 26/11/2013 10:13 a.m., Ghassan Gharabli wrote:
>>> Hi,
>>>
>>> I have built a PHP script to cache HTTP 1.X 206 Partial Content like
>>> "WindowsUpdates" & Allow seeking through Youtube & many websites .
>>>
>>
>> Ah. So you have written your own HTTP caching proxy in PHP. Well done.
>> Did you read RFC 2616 several times? your script is expected to to obey
>> all the MUST conditions and clauses in there discussing "proxy" or "cache".
>>
>
> Yes , I have read it and I will read it again , but the reason i am
> building such a script is because internet here in Lebanon is really
> expensive and scarce.
>
> As you know Youtube is sending dynamic chunks for each video . For
> example , if you watch a video on Youtube more than 10 times , then
> Squid fill up the cache with more than 90 chunks per video , that is
> why allowing to seek at any position of the video using my script
> would save me the headache .
>

Youtube is a special case. They do not strictly use Range requests for
the video seeking. If you are getting that lucky you.
They are also multiplexing videos via multiple URLs.

>>
>> NOTE: the easy way to do this is to upgrade your Squid to the current
>> series and use ACLs on the range_offset_limit directive. That way Squid
>> will convert Range requests to normal fetch requests and cache the
>> object before sending the requested pieces of it back to the client.
>> http://www.squid-cache.org/Doc/config/range_offset_limit/
>>
>>
>
> I have successfully supported HTTP/206, if the object is cached and my
> target is to enable Range headers, as I can see that iPhones or Google
> Chrome check if the server has a header Accept-Ranges: Bytes then they
> send a request bytes=x-y or multiple bytes like bytes=x-y,x-y .
>

Yes that is how Ranges requests and responses work.

What I meant was that Squid already contained a feature to selectively
cause the entire object to cache so it could generate the 206 response
for clients.

>>> I am willing to move from PHP to C++ hopefully after a while.
>>>
>>> The script is almost finished , but I have several question, I have no
>>> idea if I should always grab the HTTP Response Headers and send them
>>> back to the borwsers.
>>
>> The response headers you get when receiving the object are meta data
>> describing that object AND the transaction used to fetch it AND the
>> network conditions/pathway used to fetch it. The cachs job is to store
>> those along with the object itself and deliver only the relevant headers
>> when delivering a HIT.
>>
>>>
>>> 1) Does Squid still grab the "HTTP Response Headers", even if the
>>> object is already in cache or Squid has already a cached copy of the
>>> HTTP Response header . If Squid caches HTTP Response Headers then how
>>> do you deal with HTTP CODE 302 if the object is already cached . I am
>>> asking this question because I have already seen most websites use
>>> same extensions such as .FLV including Location Header.
>>
>> Yes. All proxies on the path are expected to relay the end-to-end
>> headers, drop the hop-by-hop headers, and MUST update/generate the
>> feature negotiation and state information headers to match its
>> capabilities in each direction.
>>
>>
>
> Do you mean by Yes , for grabbing the Http Response Headers even if
> the object is already in cache, so therefore latency of network is
> always added even if MISS or HIT situation?

No. I mean the headers received along with the object need to be stored
with it and sent on HITs.
I see many people thinking they can just store the object by itself same
as a webs server stores it. But that way looses the vital header
information.

> I have tested Squid and I
> have noticed that reading HIT objects from Squid takes about 0.x ms,
> which I believe objects are always offline until expiry occurs.Right?
>
> Till now I am using $http_response_headers as it is the fastest method
> by far , but I still have an issue with latency as for each request
> the function takes about 0.30s, which is really high, even if my
> network latency is 100~150 ms. That is why I have thought that I could
> possibly grab the HTTP Response Headers for the first time and store
> them, so if the URI was called for a second time, then I would send
> them the cached Headers instead of grabbing them again

This is the way you MUST do it. To retain Last-Modified, Age, Date, ETag
and other critical headers.
Network latency reduction is just a useful side effect.

> , to eliminate
> the network latency. But I still have an issue ... How am i going to
> know if the website sends HTTP/302 (because some websites send
> HTTP/302 for the same requested file name ), if I am not grabbing the
> header again in a HIT situation just to improve the latency. Second
> issue is Saving headers of CDN.

In HTTP the 302 response is an "object" to be cached same as 200 when it
contains Cache-Control or Expiry and/or Last-Modified headers sufficient
to determine freshness/staleness.
NOTE: it has no meaning about the Range transaction except perhapse to
be a response without Ranges.

Of course you can choose not to cache it. But be aware that Squid will
try to.

>>>
>>> 2) Do you also use mime.conf to send the Content-Type to the browser
>>> in case of FTP/HTTP or only FTP ?
>>
>> Only FTP and Gopher *if* Squid is translating from the native FTP/Gopher
>> connection to HTTP. HTTP and protocols relayed using HTTP message format
>> are expected to supply the correct header.
>>
>>>
>>> 3) Does squid compare the length of the local cached copy with the
>>> remote file if you already have the object file or you use
>>> refresh_pattern?.
>>
>> Content-Length is a declaration of how many payload bytes are following
>> the response headers. It has no relation to the servers object except in
>> the special case where the entire object is being delivered as payload
>> without any encoding.
>>
>>
>
> I am only caching objects that have "Content-Length" header, if the
> size was greater than 0 and I have noticed that there are some files
> like XML , CSS , JS, which I believe I should save, but do you think I
> must follow if-modified header to see if there is a fresh copy?.

If you already have an object cached for the URL being ruqested with any
If-* header then you need to revalidate it following the RFC 2616
instructions for revalidation calculation. Or just MISS - but that makes
caching a bit useless because If-* happen a lot in HTTP/1.1.

NOTE: the revalidation calculation is done against the headers you have
cached with the object. The results will determine whether a HIT or MISS
can happen on it.

>>> I am really confused with this issue , because I am always getting a
>>> headers list from the internet and I send them back to the browser
>>> (using PHP and Apache) even if the object is in cache.
>>
>> I am really confused about what you are describing here. You should only
>> get a headers list from the upstream server if you have contacted one.
>>
>>
>> You say the script is sending to the browser. This is not true at the
>> HTTP transaction level. The script sends to Apache, Apache sends to
>> whichever software requested from it.
>>
>> What is the order you chained the Browser, Apache and Squid ?
>>
>> Browser -> Squid -> Apache -> Script -> Origin server
>> or,
>> Browser -> Apache -> Script -> Squid -> Origin server
>>
>>
>> Amos
>
> Squid configured as:
> Browser -> Squid -> Apache -> Script -> Origin server
>
> url_rewrite_program c:/PHP/php.exe c:/squid/etc/redir.php
> acl dont_pass url_regex ^http:\/\/192\.168\.10\.[0-9]\:312(6|7|8)\/.*?
> acl denymethod method POST
> acl denymethod method PUT
> url_rewrite_access deny dont_pass
> url_rewrite_access deny denymethod
> url_rewrite_access allow all
> url_rewrite_children 10
> #url_rewrite_concurrency 99
>
> I hope I can enable url_rewrite_concurrency , but if I enable
> concurrency then I must always echo back the ID, even if I am hitting
> cache or maybe I dont understand the behavior of the URL_REWRITE
> manual while fgets(STDIN) .

Your helper MUST always return exactly one line of output for every line
of input regardless of concurrency.

Making that line of output contain the concurrency ID number instead of
being empty is trivial and it allows you to return results out of order
if you want or need to. For example in helpers using threads that take
different lengths of time to complete.

Amos
Received on Wed Nov 27 2013 - 11:28:37 MST

This archive was generated by hypermail 2.2.0 : Fri Nov 29 2013 - 12:00:05 MST