Re: [squid-users] Squid with PHP & Apache from Ghassan Gharabli on 2013-11-28 (squid-users)

From: Ghassan Gharabli <sounarose_at_googlemail.com>
Date: Fri, 29 Nov 2013 00:43:29 +0200

On Wed, Nov 27, 2013 at 7:44 AM, Eliezer Croitoru <eliezer_at_ngtech.co.il> wrote:
> Hey Ghassan,
>
> Moving from PHP to C++ is a nice idea.
> I do not know the size of the cache or it's limits but couple things to
> consider while implementing the cache:
> * clients latency
> * server overload
> * total cost
> * efficiency of the cache
>
> Bandwidth can cost lots of money in some cases and which some are willing to
> pay for.
> Youtube by itself is a beast since the number of visits per video might not
> be worth all the efforts that are being invested only in one video
> file\chunk.
>
> Specifically on youtube you need to grab the response headers and in some
> cases even filter couple of them.
> If you are caching and you are 99.5% sure that this "chunk" or "file" is ok
> as it is and as an object the headers can be considered as a side effect but
> in some cases are important.
> A compromise between Response Headers from a file to "from source" is that
> in a case that the headers "file" or container is deleted to fetch new ones
> or in a case the expiration headers are "out-of-date" then fetch new
> Headers\object.
>

Actually , this is how I do it ...

Thanks to Amos again , now I am able to save response_headers with the
object in-case of static URL.

Youtube sends dynamic chunks if (HTML-5 Player) or (Flash Player) -->
First Function checks if mime argument is set, then I send different
headers to the browser to enable HTML-5 playback (that's what i was
investigating and it is working as i wanted), which should be
something like these headers :

        header("Access-Control-Allow-Origin: http://www.youtube.com");
        header("Access-Control-Allow-Credentials: true");
        header("Timing-Allow-Origin: http://www.youtube.com");

And for normal playback such as; Flash Player, then I send normal
headers . I am able to disable HTML-5 as a test by removing the
user-agent, thus forcing Youtube to always use Flash player . HTML-5
Player doesn't like latency.

If Youtube send any chunk size , the script seeks and send a chunk
from our local saved video. Note that I don't save chunks. I only
serve/stream chunks, which I am very happy using , because If I am
going to save chunks, then I wont be always hitting cache and I am
very sure that some videos aren't cacheable.

Regarding Other sites with FLV , MP4 videos .. I am also allowing to
seek any video, even if the video isn't loaded. I already tried to
cache videos using Squid & Perl Re-writer Script , but if you try to
seek at any position of the video that has not been loaded yet, then
the video starts again from the beginning of any position.

I only follow arguments like Filename.FLV?Start= (Start Offset) and others .

> The main issue with 302 is the concept behind it.
> I have seen that in the past the usage of 302 was in order to give enough
> time for the upstream proxy\cdn node to fetch more data but in some cases it
> was a honest redirection towards the best origin server.
>
> In a case you know that uses 302 responses handle them by the site rather
> then in a Global way.
>
> The Content-Type is used from the origin server headers since this is
> probably what the client application expects.
> On a web-server you would see that by the file extension the Content-Type
> can be decided but this is not how squid handles http requests at all.
>
> Squid algorithm are pretty simple while considering the basic "shape" of the
> object from the headers.
>
> It is indeed an overhead to fetch from the web couple headers and there are
> some cases which it can be avoided but a re-validation of the integrity of
> the object\file is kind of important.
>
> Back to the beginning of the Email:
> If you do "know" that the object as it is now will not be changed for
> example as the owner of the web-service you can even serve the client
> "stale" content.
>
> There is no force in the world that limits you to do that.
>

Yes , you are right . I have optimized the script to have better
latency, which is now playing between 0.20 s ~ 0.30 s before saving
response_headers and after saving response headers .. execution time
came to 0 seconds . I am going to see the values in milliseconds and
optimize it again. But this progressed quite well.

I am only saving static response headers, as I was wondering how squid
deals with dynamic URLs .. If the URL changes every time you refresh
the page, then we are going to save Headers each time, which is not
such a good idea, so I made a function to detect CDN Content using
regex and based on the CDN content, I then save object file and header
using the same method.

> I can say that for example for youtube I was thinking about using another
> approach which would "rank" videos and will consider removing videos that
> was used once or twice per two weeks(which is depends on the size of the
> storage and load).
>
> If you do have a strong server that can run PHP you can try to take for a
> spin squid with StoreID that can help you to use only squid for youtube
> video caching.
>

Good idea . I already thought of adding ranking script with help from
MySQL and then I can calculate the percentage of HIT & MISS Requests.

> The only thing you will need to take care off is 302 response with an ICAP
> service for example.
>
> I do know how tempting it is to use PHP and it can be in many cases better
> for a network to use another solution then only squid.
>
> I do not know if you have seen this article:
> http://wiki.squid-cache.org/ConfigExamples/DynamicContent/Coordinator
>
> The article shows couple aspect of youtube caching.
>
> There was some PHP code at:
> http://code.google.com/p/yt-cache/
>
> Which I have seen long time ago.(2011-12)
>

I have seen this website before and I think this project is old. They
are saving chunks, but they made an impressive settings page.

> StoreID is at the 3.4 branch of squid and is still on the Beta stage:
> http://wiki.squid-cache.org/Features/StoreID
>
> StoreID code by itself is very well tested and I am using it on a daily
> basis not even once restarting\reloading my local server for a very long
> time.
> I have not heard about a very big production environment(clustered) reports
> in my email yet.
>
> The basic idea of StoreID is to take the current existing internals of squid
> and to "unleash" them in a way that they can be exploited\used by external
> helper.
>
> StoreID is not here to replace the PHP or any other methods that might fit
> any network, it comes to allow the admin and see the power of squid caching
> even in this "dead-end" case which requires acrobatics.
>
> You can try to just test it in a small testing environment and to see if it
> fits to you.
>
> One of the benefits that Apache+PHP has is the "Threading" which allows one
> service such as apache to utilize as much horse power as the machine has as
> a "metal".
> Since squid is already there the whole internal traffic between the apache
> and squid can be "spared" while using StoreID.
>
> Note that fetching the headers *only* from the origin server can still help
> you to decide if you want to fetch the whole object from it.
> A fetch of a whole headers set which will not exceed 1KB is worth for even a
> 200KB file size in many cases.
>

The problem is that I am using Squid (Windows version) on Windows 2008
( R2 x64 32GB Rams & HDD 6 TB ), not Linux so I am not able to benefit
from the new features squid has provided. The whole idea of building
such a script is to decrease the pain that I am still suffering till
now and I really hope I will also be able to cache SSL with better
efficiency .

I heard that BlueCoat system caches SSL, but I am not sure if it is
acting as a man-in-the-middle, which requires certificate to be
installed in the client's machine!.

Thank you again for providing me with more information to the subject.

I really appreciate your correspondence.

> I have tried to not miss somethings but I do not want to write a whole
> Scroll about yet so if there is more interest in it I will add more later.
>
> Regards,
> Eliezer
>
>
> On 25/11/13 23:13, Ghassan Gharabli wrote:
>>
>> Hi,
>>
>> I have built a PHP script to cache HTTP 1.X 206 Partial Content like
>> "WindowsUpdates" & Allow seeking through Youtube & many websites .
>>
>> I am willing to move from PHP to C++ hopefully after a while.
>>
>> The script is almost finished , but I have several question, I have no
>> idea if I should always grab the HTTP Response Headers and send them
>> back to the browsers.
>>
>>
>> 1) Does Squid still grab the "HTTP Response Headers", even if the
>> object is already in cache or Squid has already a cached copy of the
>> HTTP Response header . If Squid caches HTTP Response Headers then how
>> do you deal with HTTP CODE 302 if the object is already cached . I am
>> asking this question because I have already seen most websites use
>> same extensions such as .FLV including Location Header.
>>
>> 2) Do you also use mime.conf to send the Content-Type to the browser
>> in case of FTP/HTTP or only FTP ?
>>
>> 3) Does squid compare the length of the local cached copy with the
>> remote file if you already have the object file or you use
>> refresh_pattern?.
>>
>> 4) What happens if the user modifies a refresh_pattern to cache an
>>
>> object, for example .xml which does not have [Content-Length] header.
>> Do you still save it, or would you search for the ignore-headers used
>> to force caching the object and what happens if the cached copy
>> expires , do you still refresh the copy even if there is no
>> Content-Length header?.
>>
>> I am really confused with this issue , because I am always getting a
>> headers list from the Internet and I send them back to the browser
>> (using PHP and Apache) even if the object is in cache.
>>
>> Your help and answers will be much appreciated
>>
>> Thank you
>>
>> Ghassan
>>
>
Received on Thu Nov 28 2013 - 22:43:37 MST

This archive was generated by hypermail 2.2.0 : Fri Nov 29 2013 - 12:00:05 MST