Re: [squid-users] Mime.conf

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Tue, 19 Jan 2010 20:49:35 +1300

Jason Spegal wrote:
> On 1/18/2010 8:55 PM, Amos Jeffries wrote:
>> On Mon, 18 Jan 2010 13:18:20 -0500, Jason Spegal<jspegal_at_comcast.net>
>> wrote:
>>
>>> Alrighty. Did some more research and found a solution to my problem
>>> which leads to another issue.
>>>
>>> My problem: I was trying to serve a proxy auto configuration file
>>> (wpad.dat) from an internal webserver (http://wpad/). When the client
>>> down the pipe after squid picked it up the file was served with the mime
>>> type chemical/x-mopac-input. When I went direct to the webserver it
>>> served the correct mime type (which I had forced it to).
>>>
>>> Solution: On Gentoo squid is using the /etc/mime.types file to guess the
>>> mime type instead of what the remote webserver is saying the
>>> file is. I
>>>
>> Point 1: Squid does not do that. Does not use mime.types at all.
>>
>> Content-Type headers are passed through unchanged from what is received
>> unless administratively changed by header_replace.
>>
> Taken from access.log
>
> Before changing mime.types
>
> 1263657638.249 0 10.10.122.248 TCP_MEM_HIT/200 670 GET
> http://wpad/wpad.dat - NONE/- chemical/x-mopac-input
> 1263661679.834 0 10.10.122.239 TCP_MEM_HIT/200 670 GET
> http://wpad/wpad.dat - NONE/- chemical/x-mopac-input
> 1263662648.054 9 10.10.122.248 TCP_CLIENT_REFRESH_MISS/200 654 GET
> http://wpad/wpad.dat - DIRECT/10.10.122.250 chemical/x-mopac-input
> 1263662742.482 4 10.10.122.248 TCP_CLIENT_REFRESH_MISS/200 654 GET
> http://wpad/wpad.dat - DIRECT/10.10.122.250 chemical/x-mopac-input
> 1263662752.973 0 10.10.122.248 TCP_IMS_HIT/304 264 GET
> http://wpad/wpad.dat - NONE/- chemical/x-mopac-input
> 1263664740.203 0 10.10.122.248 TCP_MEM_HIT/200 669 GET
> http://wpad/wpad.dat - NONE/- chemical/x-mopac-input
>
> After changing mime.types
>
> 1263834369.649 1 10.10.122.241 TCP_REFRESH_UNMODIFIED/200 647 GET
> http://wpad/wpad.dat - DIRECT/10.10.122.250
> application/x-ns-proxy-autoconfig
> 1263834539.719 0 10.10.122.241 TCP_MEM_HIT/200 657 GET
> http://wpad/wpad.dat - NONE/- application/x-ns-proxy-autoconfig
> 1263834791.576 0 10.10.122.241 TCP_MEM_HIT/200 657 GET
> http://wpad/wpad.dat - NONE/- application/x-ns-proxy-autoconfig
> 1263834822.423 0 10.10.122.241 TCP_MEM_HIT/200 657 GET
> http://wpad/wpad.dat - NONE/- application/x-ns-proxy-autoconfig

This log contains what the web server passed Squid. Not what Squid
passed the clients.
Q: Is the WPAD web server on the same box where you are altering mime.types?

>
> I just double checked that (ForceType application/x-ns-proxy-autoconfig)
> in my apache vhost config is working correctly. Also apache's mime.types
> file is setup correctly for this particular item.
>>> fixed the file which I also noticed has several other issues answering
>>> my other other issue, my is 95% of my data being caught in the catch all
>>> refresh_pattern instead of the mime type ones.
>>>
>> Point 2: Squid does not accept mime types in the refresh_pattern
>> directive.
>>
> This explains a few things.
>> Are you _sure_ that:
>> * the PAC file is not cached with old headers from before your changes?
>>
> Yes

I can only get Squid to produce the wrong mime type by altering
refresh_pattern to the values you have in your config. With that done
Squid very consistently insists on producing a HIT with the first mime
header received, no matter how they change on the server or what cache
controls are passed to Squid by the server.

>> * the PAC file is actually being fetched from the web server you are
>> expecting?
>>
> Yes
>> * this is an official build of Squid?
>>
> Yes, see below.
>> * nobody has applied third-party patches to it?
>> (none of the official Gentoo patches change mime.types.
>> http://sources.gentoo.org/viewcvs.py/gentoo-x86/net-proxy/squid/files/)
>>
>>
> Fairly sure.
>> What headers does this produce when run on the Squid box?
>> squidclient -v -h wpad -p 80 /wpad.dat
>>
>>
>>
> I'm posting version and configuration at the bottom of this email.
> Refresh patterns will be changed after this email is sent. This is a
> standard gentoo install with the epoll USE flag.
>
> [ebuild R ] net-proxy/squid-3.0.19 USE="caps epoll ldap mysql pam
> samba sqlite ssl -icap-client (-ipf-transparent) -kerberos -kqueue
> -logrotate* -nis (-pf-transparent) -postgres -radius -sasl (-selinux)
> -snmp -zero-penalty-hit" 0 kB

Okay. So no reason whatsoever why the mime type is changing.

>
> (squidclient -v -h wpad -p 80 /wpad.dat) yeilds
>
> headers: 'GET /wpad.dat HTTP/1.0
> Accept: */*
>
> '
> HTTP/1.1 404 Not Found
> Date: Tue, 19 Jan 2010 03:27:19 GMT
> Server: Apache
> Content-Length: 265
> Connection: close
> Content-Type: text/html; charset=iso-8859-1
>
> <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
> <html><head>
> <title>404 Not Found</title>
> </head><body>
> <h1>Not Found</h1>
> <p>The requested URL /wpad.dat was not found on this server.</p>
> <hr>
> <address>Apache Server at localhost Port 80</address>
> </body></html>
>
>
> So I used GET instead.
>
> (GET http://wpad/wpad.dat -USed)
>
> GET http://wpad/wpad.dat
> User-Agent: lwp-request/5.827 libwww-perl/5.831
>
> GET http://wpad/wpad.dat --> 200 OK
> Connection: close
> Date: Tue, 19 Jan 2010 03:28:59 GMT
> Accept-Ranges: bytes
> Age: 412
> ETag: "736a9e-119-47d6be3f06d80"
> Server: Apache
> Content-Length: 281
> Content-Type: application/x-ns-proxy-autoconfig
> Last-Modified: Mon, 18 Jan 2010 08:10:46 GMT
> Client-Date: Tue, 19 Jan 2010 03:28:59 GMT
> Client-Peer: 10.10.122.250:80
> Client-Response-Num: 1

That reply appears to have gone through Squid. I'm particularly
interested in the headers going _into_ Squid.

I think try this as well and compare to the above set.
   squidclient -v -h wpad -p 80 -j wpad /wpad.dat

>
>>> Of note for other Gentoo& Debian users: From mime.types # This file is
>>>
>>
>>> part of the app-misc/mime-types package, which is based on debian's
>>> "mime-support".
>>>
>>> So my question is now; how do I force squid to use the mime-type
>>> delivered by the remote webserver without killing mime.types and thus
>>> breaking my system in new and unexpected ways?
>>>
>> The official releases of Squid pass content-type headers through
>> unchanged. Something is broken.
>>
>>> On 1/15/2010 8:22 PM, Amos Jeffries wrote:
>>>
>>>> Jason Spegal wrote:
>>>>
>>>>> Is mime.conf what is used by refresh_pattern when mime types are used
>>>>> for the regex?
>>>>>
>>>> No.
>>>>
>>>> refresh_pattern uses a text regex against the requested URL string.
>>>>
>>>> mime.conf is used by FTP and Gopher directory display to show the
>>>>
>> icons.
>>
>>>>
>> Amos
>>
> Squid Cache: Version 3.0.STABLE19
> configure options: '--prefix=/usr' '--build=i686-pc-linux-gnu'
> '--host=i686-pc-linux-gnu' '--mandir=/usr/share/man'
> '--infodir=/usr/share/info' '--datadir=/usr/share' '--sysconfdir=/etc'
> '--localstatedir=/var/lib' '--sysconfdir=/etc/squid'
> '--libexecdir=/usr/libexec/squid' '--localstatedir=/var'
> '--datadir=/usr/share/squid' '--with-default-user=squid'
> '--enable-auth=basic,digest,negotiate,ntlm'
> '--enable-removal-policies=lru,heap'
> '--enable-digest-auth-helpers=password'
> '--enable-basic-auth-helpers=DB,PAM,LDAP,SMB,multi-domain-NTLM,getpwnam,NCSA,MSNT'
> '--enable-external-acl-helpers=ldap_group,wbinfo_group,ip_user,session,unix_group'
> '--enable-ntlm-auth-helpers=SMB,fakeauth'
> '--enable-negotiate-auth-helpers=' '--enable-useragent-log'
> '--enable-cache-digests' '--enable-delay-pools' '--enable-referer-log'
> '--enable-arp-acl' '--with-large-files' '--with-filedescriptors=8192'
> '--enable-caps' '--disable-snmp' '--enable-ssl' '--disable-icap-client'
> '--enable-http-violations' '--with-pthreads' '--with-aio'
> '--enable-storeio=ufs,diskd,aufs,null' '--enable-linux-netfilter'
> '--enable-epoll' 'build_alias=i686-pc-linux-gnu'
> 'host_alias=i686-pc-linux-gnu' 'CC=i686-pc-linux-gnu-gcc'
> 'CFLAGS=-march=pentium4m -O2 -pipe -fomit-frame-pointer'
> 'LDFLAGS=-Wl,-O1' 'CXXFLAGS=-march=pentium4m -O2 -pipe
> -fomit-frame-pointer'
>
>
> From squid.conf:
>
<snip>

Okay, before reading further:

   Please don't take any of the following personally. I have no idea who
configured the Squid. Or what company policy restraints they were
working under. I do know that some policies and external websites do
force extreme measures.

I make the following statements with three hats on:
  * an Internet citizen who wants websites to load reliably with the
right and current content shown
  * a webmaster who spends considerable time working to make clients
dynamic websites cacheable and efficient. (thus the angst if it shows
too thick)
  * a squid developer who spends considerable time trying to make Squid
do things properly according to the HTTP protocol RFC and helping people
leverage that for faster networks.

> acl dynamic_content urlpath_regex -i
> \.(asp|aspx|php|pl|xml|rss|kml|cgi|py|pyc) #(\?.*)?$

Hmm, any URL containing a "#" at the end. Weird thing to be looking for.

NP: The '#' sign is never sent in transmitted URLs. It's an internal tag
private to the browser. When some data needs to use that sign it is
required to always be URL-encoded for transmission.

> acl dynamic_content urlpath_regex -i http://audio*pandora.com/*.mp*

That pattern is broken on so many levels I can't even describe them in
less than a page of text. Suffice to say...

It only matches things like:
    http://example.com?urlpath=http://audipandoraZcomZm
or
   http://example.com?urlpath=http://audiooopandoraZcomZmpAIUEHB78GWa

Since...

  '*' means the previous _one_ symbol repeated zero or more times.
       example.com/?http://audiooooooopandora.com///////.mppppppp

  '.' means any symbol at all.
      example.com/?http://audipandoraZcomZm

> acl dynamic_content urlpath_regex -i cgi-bin
> cache deny dynamic_content

Well, lets say that once upon a time whole decades go in another century
that was recommended by the developers. Since 2.7 and 3.0 came out it is
not.

Of course, with the things refresh_pattern is doing, I'd hate to be a
customer who gets anything from this proxies cache.

> cache allow all
> refresh_pattern -i kh*.google.com/? 43200 80% 259200 ignore-no-cache
> ignore-private ignore-no-store ignore-auth override-expire
> override-lastmod ignore-reload
> refresh_pattern -i virtualearth.net/? 43200 80% 259200 ignore-no-cache
> ignore-private ignore-no-store ignore-auth override-expire
> override-lastmod ignore-reload

Meh. Well, yes, some websites do force radical measures due to their design.

> refresh_pattern application/* 43200 80% 259200 ignore-no-cache
> ignore-private ignore-no-store ignore-auth
> refresh_pattern audio/* 43200 80% 259200 ignore-no-cache ignore-private
> ignore-no-store ignore-auth

I've never seen a website that uses application/ and audio/ in their
folder paths. But if your users ever visit one, the pages will be stored
for 6 months.

That _may_ catch some java WAR websites which expose the
~/application/name/pages.html path bits. But I would think most of those
are hiding behind apache and doing path re-writing.

> refresh_pattern images/* 10080 16% 259200 ignore-no-cache ignore-private
> ignore-no-store ignore-auth override-expire override-lastmod

Any website which uses the standard technique of placing common images
into a shared folder:
For example:
   http://example.com/images/spacer.gif

NP: The irony here is that _these_ images are almost guaranteed to have
correct long-term cacheability information attached by the originating
web server.

> refresh_pattern text/* 0 16% 259200 refresh-ims
> refresh_pattern video/* 43200 80% 259200 ignore-no-cache ignore-private
> ignore-no-store ignore-auth

All URLs containing a folder called video/ or text/.
For example:
   http://example.com/video/index.html
   http://example.com/plaintext/index.html

> refresh_pattern . 0 80% 259200 ignore-no-cache ignore-private
> ignore-no-store ignore-auth

So... _everything_ that is not already stored for 6 months ... gets
stored for 6 months unless clients explicitly send flush requests with
Ctrl+Reload.

Regardless of what the original website is designed for!!!

Be it some a captcha security image, someones bank account details, or a
picture of their kitten.

And you are doing this on a transparent proxy.... Pretty much a textbook
example of information leak via man-in-middle attack.

> reply_header_access Pragma deny all
> reply_header_access Cache-Control deny all

?? force browsers and downstream caches to think they can store anything
and everything?
Careful. This is generally not a good idea.

The effect _overall_ is that most dynamic content passes straight
through the proxy and gets cached however the client browser wants to
cache it (because you stripped the expiry and privacy information). The
rest of the content will be stored in your Squid for very long periods
and clients who request new updated data will be sent the old version
and told it has not changed.

There will be some overlap in websites which generate static content at
shorter intervals (ie facebook, and mailing list archives) from which
your clients never seem to get the new versions in a timely manner. Only
the rather broken ones which serve static content through very
inefficient dynamic re-processors will look right all the time.

> deny_info about:blank blocked_sites

oooh nasty. You get a lot of phone calls about the Internet being "down"
with no explanation?

Amos

-- 
Please be using
   Current Stable Squid 2.7.STABLE7 or 3.0.STABLE21
   Current Beta Squid 3.1.0.15
Received on Tue Jan 19 2010 - 07:49:45 MST

This archive was generated by hypermail 2.2.0 : Wed Jan 20 2010 - 12:00:04 MST