Re: [squid-users] Caching Pandora

From: Jason Spegal <jspegal_at_comcast.net>
Date: Sun, 26 Jul 2009 11:03:00 -0400

Amos Jeffries wrote:
> Jason Spegal wrote:
>> I am currently using the following for the items in question.
>>
>> refresh_pattern pandora.com 0 300% 31536000
>> refresh_pattern . 0 80% 3156000
>
> The dot (.) pattern matches every URL in existence.
>
> For the pandora files you don't need to go 300%, but do need to add
> all the available override-* and ignore-* violations available to the
> "pandora.com" pattern.
>
> I'd also try making the pandora pattern:
> -i http://[^a-z\.]*pandora\.com/?
>
Ok, the changes were made so the new line is
refresh_pattern -i http://[^a-z\.]*pandora\.com/? 0 300% 31536000
override-expire reload-into-ims ignore-reload ignore-no-cache
ignore-private ignore-no-store ignore-auth

The following are the results from store.log after the change. It
appears that they are still failing to cache.

1248619647.717 RELEASE -1 FFFFFFFF 2DDD8D498CF4C28F60520AA26761A1F6 200
1248619640 -1 1248619640 application/octet-stream 1627255/1627255
GET http://audio-sjl-t2-1.pandora.com/access/7886817187448819808.mp4?
1248619657.439 RELEASE -1 FFFFFFFF 21387EECAF5FFCF61AEE68B2494F7A01 200
1248619621 -1 1248619621 application/octet-stream 6327065/6327065
GET http://audio-sjl-t2-2.pandora.com/access/8544252120326380207.mp3?
1248619860.906 RELEASE -1 FFFFFFFF B838385F620C52ECE3B4F4E3BBC21270 200
1248619847 -1 1248619847 application/octet-stream 2462059/2462059
GET http://audio-sjl-t1-2.pandora.com/access/3264482519687036142.mp4?
1248619895.636 RELEASE -1 FFFFFFFF 86A59F24244895283DC5BE8124F7C248 200
1248619878 -1 1248619878 application/octet-stream 4585429/4585429
GET http://audio-sjl-t3-2.pandora.com/access/7586905698959626071.mp3?

>>
>> With violations off these work well. However they fail to cache all
>> the items I would like. When I had violations on I had tried
>> refresh_pattern . 0 0% 0 as well as setting all refresh_pattern to 0
>> 0% 0 which still failed to refresh the pages properly. I had also
>> tried rebuilding the cache from scratch several times.
>>
>> Other relevant pattern's I am using:
>>
>> #Dynamic Content
>> refresh_pattern -i cgi-bin 0 0% 0 refresh-ims
>
> The following is a violation even if it works with violations not
> enabled.
>> refresh_pattern -i \? 0 0% 3156000 refresh-ims
>> refresh_pattern -i .(asp|aspx|php|pl|xml|rss|kml|cgi|py|pyc) 0 0% 0
>> refresh-ims
>
>> #HTML
>> refresh_pattern text/html 0 80% 2592000 refresh-ims
>> refresh_pattern text/css 0 80% 2592000 refresh-ims
>>
>> #Java & Javascript
>> refresh_pattern -i .(js|jar|java) 0 100% 31536000
>>
>> #By MIME-Type
>> refresh_pattern application/* 0 300% 31536000
>> refresh_pattern audio/* 0 300% 31536000
>> refresh_pattern images/* 0 300% 31536000
>> refresh_pattern text/* 0 300% 31536000
>> refresh_pattern video/* 0 300% 31536000
>>
>
> ? mime patterns in the URL? with Squid?
>
> Do you have a patch that doe this? If so please consider contributing
> back to the project.
>
I take it your referring to refresh_pattern -i \? 0 0% 3156000
refresh-ims. I was under the impression that squid supports this. I am
using Squeezzer2 to check how well the patterns work. It does seem to work.

Also the version of squid I am using is 3.0.16 with the following patches
squid-3.0.16-adapted-zph.patch
squid-3.0.16-cross-compile.patch
squid-3.0.16-gentoo.patch

It is complied through Gentoo's Emerge.
>>
>> When I had violations on the Pandora entry was similar to this...
>>
>> refresh_pattern pandora.com 0 300% 31536000 override-expire
>> reload-into-ims ignore-reload ignore-no-cache ignore-private
>> ignore-no-store ignore-auth
>
> A single pattern like that should be all you need to add.
>
> Some of the non-caching parameters are only able to be overridden in
> the 2.HEAD code though. You may need to grab a copy of the HEAD code
> and use that.
>
>
> PS. all of your file extension patterns above are using the very
> unsafe .XX syntax. The pattern is a regex and matches anywhere in the
> URL. Its likely catching a whole lot of URL which should not.
>
> Please use: \.XX(\?.*)?$ instead. ie \.(js|jar|java)(\?.*)?$
>
I'm not sure I understand this example. Can you give a literal example
please? From what I'm understanding your saying refresh_pattern -i .jpg
0 300% 31536000 would be bad because http://www.jpgas.com would be
cached with that pattern which may, for sake of example, have settings
that would break that site. Your recommending refresh_pattern -i \.jpg 0
300% 31536000 and not doing something like refresh_pattern -i
.(jpg|gif|png|ico|tga) 0 300% 31536000 ?

As far as messing with the code goes I haven't been into doing that as
of yet. For my purposes my goal is to build a server/router/etc that
will turn crappy internet into good internet while being able to service
a number of people. This runs my home network normally these days and
was originally conceived and built to support 1000 users for a dormitory
through a single cable modem. Tweaking it for maximium efficiency is a
hobby for me now. My coding skills are fairly weak and I wouldn't know
where to start for a lot of this. I am willing to help test things out
and such however.

> Amos
>
>
>> Amos Jeffries wrote:
>>> Jason Spegal wrote:
>>>> I would wager it's content control given what they are. However
>>>> with violations on they can be cached. Without they cannot. I just
>>>> haven't been able to figure out how to get squid to behave with
>>>> violations turned on. My only other option I can see is to setup a
>>>> second squid with violations and filter all the traffic to/from
>>>> Pandora through it.
>>>
>>> Use refresh_pattern with a regex that only matches pandora URL.
>>>
>>> I'll wager you have either added all the overrides to the . pattern,
>>> or have a overly-greedy regex in use.
>>>
>>> Amos
>>>
>>>>
>>>> Adrian Chadd wrote:
>>>>> This doesn't surprise me. They may be trying to maximise outbound
>>>>> bits, or try to retain control over content, or not understanding
>>>>> caching, or all/combination of the above.
>>>>>
>>>>> I'd suggest contacting them and asking.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> adrian
>>>>>
>>>>> 2009/7/26 Jason Spegal <jspegal_at_comcast.net>:
>>>>>
>>>>>> A little bit messy but here are some snippets.
>>>>>>
>>>>>> ###Access.log
>>>>>>
>>>>>> 1248572380.275 178 10.10.122.248 TCP_REFRESH_UNMODIFIED/304
>>>>>> 232 GET
>>>>>> http://images-sjl-1.pandora.com/images/public/amz/1/2/0/4/727361124021_500W_495H.jpg
>>>>>>
>>>>>> - DIRECT/208.85.40.13 -
>>>>>> 1248572409.144 8472 10.10.122.241 TCP_MISS/200 1581181 GET
>>>>>> http://audio-sjl-t3-2.pandora.com/access/7008639604707703825.mp4? -
>>>>>> DIRECT/208.85.41.38 application/octet-stream
>>>>>> 1248572439.512 94 10.10.122.241 TCP_MEM_HIT/200 55396 GET
>>>>>> http://images-sjl-2.pandora.com/images/public/amz/3/0/2/3/602498413203_500W_499H.jpg
>>>>>>
>>>>>> - NONE/- image/jpeg
>>>>>> 1248572570.898 300 10.10.122.248 TCP_MISS/200 6521 GET
>>>>>> http://images-sjl-3.pandora.com/images/public/amz/2/2/4/4/039841434422_130W_130H.jpg
>>>>>>
>>>>>> - DIRECT/208.85.41.23 image/jpeg
>>>>>> 1248572600.538 29937 10.10.122.248 TCP_MISS/200 7704188 GET
>>>>>> http://audio-sjl-t3-2.pandora.com/access/3642267922875646389.mp3? -
>>>>>> DIRECT/208.85.41.38 application/octet-stream
>>>>>> 1248572615.735 11507 10.10.122.241 TCP_MISS/200 2109481 GET
>>>>>> http://audio-sjl-t2-2.pandora.com/access/5722981497105294607.mp4? -
>>>>>> DIRECT/208.85.41.36 application/octet-stream
>>>>>> 1248572635.903 179 10.10.122.248 TCP_REFRESH_UNMODIFIED/304
>>>>>> 232 GET
>>>>>> http://images-sjl-3.pandora.com/images/public/amz/2/2/4/4/039841434422_130W_130H.jpg
>>>>>>
>>>>>> - DIRECT/208.85.41.23 -
>>>>>> 1248572641.444 40 10.10.122.241 TCP_HIT/200 21616 GET
>>>>>> http://images-sjl-2.pandora.com/images/public/amz/8/7/6/1/602498611678_300W_273H.jpg
>>>>>>
>>>>>> - NONE/- image/jpeg
>>>>>>
>>>>>> ###Store.log
>>>>>>
>>>>>> 1248572380.275 RELEASE -1 FFFFFFFF
>>>>>> 097EAE1108DCEF192ED1C3BFF1F6C1B5 304
>>>>>> 1248572380 -1 -1 unknown -1/0 GET
>>>>>> http://images-sjl-1.pandora.com/images/public/amz/1/2/0/4/727361124021_500W_495H.jpg
>>>>>>
>>>>>> 1248572409.144 RELEASE -1 FFFFFFFF
>>>>>> 6B93B1BF958703B3FC3CD1ADDD515695 200
>>>>>> 1248572400 -1 1248572400 application/octet-stream
>>>>>> 1580815/1580815 GET
>>>>>> http://audio-sjl-t3-2.pandora.com/access/7008639604707703825.mp4?
>>>>>> 1248572570.897 SWAPOUT 00 0004CF23
>>>>>> BEEE111A39B596B14903743011AF2C36 200
>>>>>> 1248572570 1248490006 -1 image/jpeg 6181/6181 GET
>>>>>> http://images-sjl-3.pandora.com/images/public/amz/2/2/4/4/039841434422_130W_130H.jpg
>>>>>>
>>>>>> 1248572600.538 RELEASE -1 FFFFFFFF
>>>>>> 070416ED935AD18DCA793569D2C6A652 200
>>>>>> 1248572570 -1 1248572570 application/octet-stream
>>>>>> 7703822/7703822 GET
>>>>>> http://audio-sjl-t3-2.pandora.com/access/3642267922875646389.mp3?
>>>>>> 1248572615.735 RELEASE -1 FFFFFFFF
>>>>>> B0EB42B39131DF028BA3BE9A39CC24E4 200
>>>>>> 1248572604 -1 1248572604 application/octet-stream
>>>>>> 2109115/2109115 GET
>>>>>> http://audio-sjl-t2-2.pandora.com/access/5722981497105294607.mp4?
>>>>>> 1248572635.903 RELEASE -1 FFFFFFFF
>>>>>> CDCA0D3510080D121E5578310976676E 304
>>>>>> 1248572635 -1 -1 unknown -1/0 GET
>>>>>> http://images-sjl-3.pandora.com/images/public/amz/2/2/4/4/039841434422_130W_130H.jpg
>>>>>>
>>>>>> 1248572886.822 RELEASE -1 FFFFFFFF
>>>>>> A95C86074129546301911C2FC251071D 200
>>>>>> 1248572872 -1 1248572872 application/octet-stream
>>>>>> 2086824/2086824 GET
>>>>>> http://audio-sjl-t1-1.pandora.com/access/5188159311574708305.mp4?
>>>>>>
>>>>>> ###Wireshark
>>>>>>
>>>>>> Hypertext Transfer Protocol
>>>>>> HTTP/1.0 200 OK\r\n
>>>>>> Date: Sun, 26 Jul 2009 05:12:58 GMT\r\n
>>>>>> Server: Apache\r\n
>>>>>> Content-Length: 6137729\r\n
>>>>>> Cache-Control: no-cache, no-store, must-revalidate, max-age=-1\r\n
>>>>>> Pragma: no-cache, no-store\r\n
>>>>>> Expires: -1\r\n
>>>>>> Content-Type: application/octet-stream\r\n
>>>>>> X-Cache: MISS from ichiban\r\n
>>>>>> X-Cache-Lookup: MISS from ichiban:3128\r\n
>>>>>> Via: 1.0 ichiban (squid)\r\n
>>>>>> Proxy-Connection: keep-alive\r\n
>>>>>> \r\n
>>>>>>
>>>>>> mos Jeffries wrote:
>>>>>>
>>>>>>> Jason Spegal wrote:
>>>>>>>
>>>>>>>> I was able to cache Pandora by compiling with
>>>>>>>> --enable-http-violations
>>>>>>>> and using a refresh_pattern to cache everything regardless.
>>>>>>>> This however
>>>>>>>> broke everything by preventing proper refreshing of any site.
>>>>>>>> If it could be
>>>>>>>> worked where violations only happened as directly specified in the
>>>>>>>> configuration it would be a workable solution. I did some
>>>>>>>> testing and I
>>>>>>>> could not confirm that it was anything in the configuration
>>>>>>>> file itself that
>>>>>>>> was causing the issue. I wouldn't recommend using this as such.
>>>>>>>>
>>>>>>>>
>>>>>>> Which indicates that there are fine tuning possible to cache
>>>>>>> just Pandora.
>>>>>>> Find yoursef one of the Pandora URLs in your access.log and take
>>>>>>> a visit to
>>>>>>> www.redbot.org or the ircache.org cacheability engine.
>>>>>>>
>>>>>>>
>>>>>>> Amos
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Henrik Nordstrom wrote:
>>>>>>>>
>>>>>>>>> lör 2009-07-25 klockan 12:05 -0600 skrev Brett Glass:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> One of the largest consumers of our HTTP bandwidth is
>>>>>>>>>> Pandora, the free
>>>>>>>>>> music service. Unfortunately, Pandora marks its streams as
>>>>>>>>>> non-cacheable and
>>>>>>>>>> also puts question marks in the URLs, which is a huge waste
>>>>>>>>>> of bandwidth.
>>>>>>>>>> How can this be overridden?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> The questionmark can be ignored. See the "cache" directive.
>>>>>>>>> But if there
>>>>>>>>> is other parameters behind there (normally not logged) that
>>>>>>>>> just may not
>>>>>>>>> help..
>>>>>>>>>
>>>>>>>>> Regarding non-cacheable.. most crap can be overridden by
>>>>>>>>> refresh_pattern.
>>>>>>>>>
>>>>>>>>> But, if it's a streaming service (I know nothing about
>>>>>>>>> Pandora) then you
>>>>>>>>> are quite likely out of luck.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Henrik
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>>
>>
>
>
Received on Sun Jul 26 2009 - 15:04:00 MDT

This archive was generated by hypermail 2.2.0 : Mon Jul 27 2009 - 12:00:05 MDT