[SQU] identifying network failures from squid logs

From: Mike Dahlin <dahlin@dont-contact.us>
Date: Wed, 06 Sep 2000 18:18:16 -0500

I'm trying to analyze the NLANR cache logs
(ftp://ftp.ircache.net/Traces/) to understand the network and server
failure patterns. The idea is to look at the requests, classify each
request as "contacted the server" or "failed to contact the server" and
from that get some idea of the network failure rate seen.

The question is: how to identify "failed" requests (that failed at the
network layer, not the application layer errors like 404 Not Found) from
the squid logs.

It seems like the following would work: count trace records that report
code "504 " ("Gateway time out") as "network failure to server"; filter
out cache hits and requests satisfied by siblings; filter out strange
codes like 500 and 400 (not many of them, anyhow), and treat the rest as
"network successful connection to server".

The only problem is that some of the results are a bit surprising. (1) I
see a lot of variability in failure rates across the installed squid
caches and (2) some of the caches report unexpectedly high
fail-to-connect-to-server rates. E.g., over one recent 3-day period, I
see sensible failure rates of 1% from pa, rtp, and sv and 3% for sd, but
I see remarkable failure rates of 11% for uc, 12% for pb, and (gulp)
32-34% for bo1 and bo2.

I have some 1-week traces of bo1 lying around from a month earlier, and
the failure rate for bo1 was 20% then over that week, so the problem
doesn't seem to be a "bad day".

These numbers seem implausible, but I can't figure out what I'm doing
wrong.

Any suggestions of why this wouldn't work (or better yet, the right way
to do this) would be much appreciated.

Thanks.
-mike

--
To unsubscribe, see http://www.squid-cache.org/mailing-lists.html
Received on Wed Sep 06 2000 - 17:19:52 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:55:13 MST