Summary of Transparent proxy/How-to

From: Erik T. Brandsberg <ebrandsb@dont-contact.us>
Date: Tue, 9 Dec 1997 12:10:52 -0600 (CST)

In the past year and a half, I have used transparent squid configurations
that have varied from a Linux boxy with it's fairly nice transparent
configuration to IP Filter (http://cheops.anu.edu.au/~avalon)
configurations. I can now say: The best configuration is with IP Filter
on any of it's supported OS's (now including Linux) with the patches that
I submitted last night. Below is a discussion of how transparent proxy
works, and how it can be implimented in a wide variety of situations.

First, what is Transparent proxies? A normal proxy is a software program
either used for security or speed purposes that allows a single computer
to do a task on the internet as a "proxy" for the system that really wants
the task done. In the case of Squid, this is making the HTTP request from
the remote site, and delivering it to the local client. In a case where
you are using a proxy, the client software TELLS the proxy software where
to go to get the data it wants, and the software does it. With squid, it
also uses this request to cache web pages, assuming that the web page is
referenced by the EXACT SAME name by the client. In a case where
www1.somedomain.com and www2.somedomain.com exist, even if the sites are
mirrors of each other, squid will consider them to be distinct. This
plays an important role in transparent proxy configurations (see below).

Most squid configurations are running (inefficiently at that) where a user
has to specify the proxy server in order to use the proxy. This is fine
in a security setup--they can't get out anyway if they don't configure it.
However, in ISP setups where vast numbers of users are hitting the web,
they might need to be talked through the configuration to make it work...
and as any ISP knows, ANYTHING that deals with talking customers through
is bad news. SO, ISP's want to do transparent proxy.

Currently, I know of two ways with generally available free OS's and squid
to impliment transparent proxy: a) via Linux's transparent proxy config b)
through IP_Filter. Without patching Squid, both have their advantages,
and disadvantages.

With Linux, you can use ipfw to redirect a connection being routed through
a system to a local port, EVEN THOUGH IT IS ADDRESSED TO A REMOTE SERVER.
Squid, in it's Virtual Accelerator configuration, will pull the IP address
out of the socket, and "forward" the request (cached if possible) to the
real web server. This was initially intended to accelerate a web server
behind the proxy, reducing the load on the web server, however, it works
in reverse... acting as an outgoing gateway. You have to run the RunAccel
script with the -V command line option on squid to activate this. In
addition, you want to turn on "Acceleration & caching" at the same time,
as some people might want to specify the proxy directly for ftp etc.
transactions. In addition, turn on the "uses host header".

Now, this is fine and dandy for http/1.1 clients that use a Host header,
it will build a URL request to be forwarded through of the format
http://123.123.123.123/url with a Host header unchanged so say (Host:
www.somedomain.com). Clients that don't have a Host header (such as older
MSIE and Netscape versions, Quicken's stock quote download, etc) will
still get to the desired site, but as always, they won't be able to access
sites that require the Host field. The main drawback of this is that
squid uses "http://123.123.123.123/url" for caching, thus sites with
several IP addresses will get redundantly cached, reducing hit rates.
Virtual configurations override the host header in the stock squid, thus
it won't fill in the server name in the rewritten URL.

IP_Filter on the other hand, actually changes the IP addresses in the
socket to be the local IP address (it can do other IP's too, but that
blows away the patches). This makes Squid not able to pull the desired IP
address from the connection--it's always the local host. As such, you CAN
NOT use the virtual configuration, you have to use use standard
accelerator with the Host flag turned on. Then squid will fill in the URL
server name based on the Host flag, and work properly. Note--this ONLY
works if the client has a Host header--and a lot of oddball http software
like Quicken and other stock quote updaters don't do this. They can be
used if they have a proxy setting though (but this requires support for an
ISP). In addition, a lot of computers still come with old versions of
MSIE, so it's a constant problem getting people updated for an ISP--the
voice of experience.

IP Filter, DOES have however, a way that a piece of software can query the
redirect database to find the true destination IP address. The patches
submitted to this mailing list does just this. In addition, the patches
rearrange the order of precidence such that it won't even bother if it has
a Host field--it will use that by default, as it is the easiest way to get
the info, THEN it will pull the IP from the socket, and override that IP
if IP Filter has a redirect listed. In this way, all setups are handled
more efficiently.

The importance of using the Host header even in virtual configuratoins
isn't too important on local web server accelerations--the web server is
probably one machine. But when using it to accelerate the entire
internet, it can make a dramatic difference in the efficiency--up to 10%
depending on the activity of users. If the URL is written with the IP
address, then that is what it caches by... CNN has 8 IP's,
www.microsoft.com has 15, Netscape has a bunch too, but only returns 1 at
a time when you query it (nice for proxies, bad for reliability),
www.weather.com has 19, etc. You get the point. If you cached on each of
the IP's then obviously, you won't get as high a hit rate. In addition,
you can't query an upstream server as your queries won't match your
upstream, even for the same site. Not cool...

Currently, we are online with an unpatched IP Filter configuration (I just
finished testing the code last night), with about 2 gigs of cache,
servicing 260+ dialin customers and assorted dedicated line customers off
of a P200 with squid (128 megs of ram, NoVM config), with about a 36% hit
rate, about 13,000 hits/hour (24/7 average) at a load average of about .1
on a FreeBSD 2.2.2-Release (with a few patches) system. It is also
routing out to 4 T1 lines, and a 100Mb ethernet to service most of the
dialins. This is just a reference point of what you can do with a box as
a router, even with squid online.

Erik Brandsberg
CIO, The Link
Received on Tue Dec 09 1997 - 10:41:20 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:37:53 MST