squid log analysis

From: Oskar Pearson <oskar@dont-contact.us>
Date: Mon, 19 Jan 1998 20:42:47 +0200

--MimeMultipartBoundary
Content-Type: text/plain; charset=us-ascii

Hi

I have been working on a log-analysis script for squid.

Basically:

The squid logs are damn big. We want to see funky stats for the whole
time the caches have been running, but they come in a 1/4gig a day
so you cannot keep a month's worth of stats.
(Does this problem sound familiar?)

We also want to be able to say 'show me the hit rate by bytes between
1st September and 1st of October... last year'.

Now - since the logs of huge the second option becomes impossible for a
heavily loaded cache.

Solution:

Split the logs into 100 second segments. Summarize the 100 seconds
and dump the data into a database. Keeping almost every stat
I can think of offhand (with some exceptions below) this reduces my
400Mb of data to a grand total of 800kb a day. And I can pull all of the
stats out of it in 6 seconds.

(analysis is about 3->4 times faster than calamaris - see

Problems:

1) You can't keep 'the top site this week was'. Since the key of the
        database is the time this causes problems. I have toyed with
        the idea of keeping a seperate database for site statistics
        (I keep a seperate one to keep stats as to which IP's are denied access)

   Since I am mostly interested in the ratio of '.com','.net' vs local
        sites I just summarise that.

2) Things may slow down with a large dataset. I have no idea how
        db format files are going to handle a huge database with keys
        that are numeric (and always increasing). I have only done
        stuff with a day's worth of data.
        You could always rotate databases though.

3) Perl does some wierd stuff with associative arrays. In normal perl
        you can have a 'hash of arrays' (one key, multiple elements
        associated with it). When you bind it to a DB file you can't
        do this. You have to 'split' them.

Problems with my current code:

1) The user interface sucks. I don't really have one. I am going to
        write a web-interface for it so you can do funky things. Please
        don't concentrate on this for the moment. I am going to write
        this soon.
2) I can't open a database for reading only. This means that if you
        try and read stats from a database that doesn't exist it
        creates an empty one. :) The fix will take less time than
        this message.
3) I don't know much about perl modules. Some of my code is probably
        way off course!
4) I use global variables... if you can give me a reasonable 'struct'
        like thingum that works with with a DB tie/opendb please let
        me know! My current method sucks, since you have to
        keep the order that you write data to the database in the
        same order or you start messing with values that aren't supposed
        to be messed with...

I would appreciate it if you guys could have a look. I would like to
include this in squid-1.2 if Duane thinks it's worthwhile.

For you to do:

1) Check that we can calculate all the stats that you want from the
        info we keep.
2) Write scripts to get the stats out. output-template.pl is a good
        place to start, I guess.
3) Create a seperate config file
4) Generalise the 'coza' and 'za' cases that I have added for my use.
        Perhaps create a seperate database that keeps track of domains
        (along with their bytes). You could create a 'watch'
        key that keeps a list of IP's that you want to keep an eye
        on. You would have to have a util to add and remove sites
        from this list...
5) Keep track of errors.

Anything else?

I throw some fields away to suit our cache setup here.... things like
sibling hits aren't counted (since we actually analyse the logs of the
caches that sent out that hit you would merely double some stats).

The analysis scripts are currently v0.3. I expect to have another
version out tomorrow or the next day... so don't expect things to be
static - even the fields may change... :)

ftp://ftp.is.co.za/private/oskar/database-stats-0.3.tar.gz

Oskar

-- 
"Haven't slept at all. I don't see why people insist on sleeping. You feel
so much better if you don't. And how can anyone want to lose a minute -
a single minute of being alive?"				-- Think Twice
--MimeMultipartBoundary--
Received on Tue Jul 29 2003 - 13:15:45 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:33 MST