[squid-users] Re: Logfile analyzing

From: Adam Aube <aaube01@dont-contact.us>
Date: Wed, 26 Jan 2005 20:42:40 -0500

airplays55@yahoo.com wrote:

> I checked out the squid log analyzer programs, But
> haven't found one that can provide a sample output
> like what I need to see on the report.

> Say for example I go to microsoft.com, click on
> "products", then click on "visual studio .NET"

> I'd like to see this in the logfile:

> http://www.microsoft.com
> http://www.microsoft.com/products
> http://www.microsoft.com/products/visual_studio

> This is a theoretical example as if those are the
> actual URL locations typed into the address bar, or
> clicked via hyperlink.

> I don't see how the access.log can be used to provide
> this kind of report.

In this case the initial request seen by Squid (and logged in access.log)
will be the URL typed into the address bar. Any additional content or
redirects will be shown after.

> For example, if I simply type microsoft.com in my
> address bar and click on "office" in the left pane,
> then check my access.log, I see 35 entries have been
> added just by clicking the "office" link once.

The first one will be for the page the hyperlink points to, and the rest
will be for any redirects and/or additional content needed for the page.

> the access.log doesn't seem to differentiate between what
> the user clicked, and what the webpage requested to
> display the whole page correctly.

Because Squid doesn't see what the user clicked (in this case, "Office") -
Squid sees the URL the hyperlink points to (which is what the browser
actually requests).

> More specifically, the first 3 entries say:
>
> 127.0.0.1 - - [22/Jan/2005:15:56:31 -0500] "GET
> http://g.microsoft.com/mh_mshp/2 HTTP/1.1" 301 538
> TCP_MISS:DIRECT

If you check in the browser, this is the URL the "Office" hyperlink points
to. Again, Squid sees requested URLs, not how the hyperlink was displayed
to the user by the browser.

In this case, the HTTP status is 301, which means this is a redirect.

> 127.0.0.1 - - [22/Jan/2005:15:56:32 -0500] "GET
> http://office.microsoft.com/home/default.aspx
> HTTP/1.1" 301 467 TCP_MISS:DIRECT

This is another redirect.

> 127.0.0.1 - - [22/Jan/2005:15:56:32 -0500] "GET
> http://office.microsoft.com/en-us/default.aspx
> HTTP/1.1" 200 52134 TCP_MISS:DIRECT

The HTTP status code of 200 indicates that this is the page that was
ultimately shown to the user.

> I don't see how the access.log can be used to provide
> this kind of report.

It can't. All Squid sees (and logs) is a series of HTTP requests from the
browser. It doesn't know how those requests were rendered by the browser.

Also, I see you are using the Common Logfile format. I would really
recommend you use the Squid native log format - most log analyzers can use
both, and the Squid native log format provides a great deal more detail.

> How is ANY logfile analyzer going to tell the
> difference between the first entry (which the user
> clicked on) and the second/third entries (which were
> requested by the html from the first entry)?

Perhaps by content-type and timing (look at the first text/html request in a
series of requests within a small window of time from the same client). But
there's no way to know with 100% certainty. If you need that level of
certainty, you should be looking at the browser history and not your proxy
logs.

> Is there is a squid configuration parameter that will
> allow the logs to be filtered appropriately?

No - because what the browser sends to Squid and what the browser shows the
client are two entirely different things. Again, for the information you
want, the browser's history is the best place to look.

Adam
Received on Wed Jan 26 2005 - 18:43:02 MST

This archive was generated by hypermail pre-2.1.9 : Mon Mar 07 2005 - 12:59:36 MST