Extended Cache Statistics - Annex I to Contract

Project description: Extended Cache Statistics

accepted for project funding by the TERENA Technical Committee, 4th December 1998.

1. Summary of proposal

A large number of administrators in today's caching community maintains a cache server based on the squid software. The squid development during the last 6 month resulted in new versions with an amazing improvement of performance, flexibility and at least stability. In contrast to the development of the caching software itself, there is an increasing lack of accompanying tools to administer a squid cache and to measure its effectiveness. The purpose of this project is the development of tools to gather and extract statistics from a number of squid caches. Additionally, the visualization of certain results should be facilitated.

The squid software generates logfiles which can enable the administrator to acquire interesting aspects of the caching service - if the necessary numbers are successfully extracted from the sheer amount of data. Due to the size of the logfiles and for performance issues in generating long term statistics, many cache maintainers only look occasionally into these logfiles to get an impression about the current situation.

Although the logfile is usually processed offline, there are still advantages to it. An online processing would require constant polling of the data of interest from the squid at a not too high rate. The offline processing allows for an arbitrarily fine granularity while processing the file.

Still, most solutions suffer from the volume of possible and interesting data. Also, most data gets more interesting, if combined with other related data. Thus the logfile parsing results must be easy to incorporate into an SQL database. From this database, different sets of simple data can be put together into a more complex view of the behaviour of a cache or set of caches, all with simple SQL statements. Still, most users and even administrators prefer easy-to-handle tools. Thus a custom-tailored web interface based upon the database should be able to return graphs on the fly based on the user selected combination of views.

Currently, there are a few logfile processors available, but all suffer from their respective singularity. Some processors will give numbers which others don't give you, some will generate long ASCII based reports, others are designed to produce images with coloured graphs. Additionally, there is no squid statistic tool which supports an interface to standard databases.

A very well known processor is 'calamaris' which at least suffers from being implemented in Perl. For instance, each weekday our 10 caches in the DFN caching service accumulate well over 4 GB of logfile data. Calamaris spends over 18 hours on a high performance workstation in order to process these data. Other logfile processors might be faster, but less minute in their output.

For performance reasons and to get an impression on calamaris potential, we developed a prototype implementation of calamaris in C++. This port is sufficiently faster, and might even be sped up further for multiprocessor and/or multihost environments, but currently lacks at least in support for the up to date version 2.x of squid.

2. Objectives

A promising prototype approach of the well-known calamaris tool to the better performing C++ is already done, as far as squid 1.x logfiles are concerned. So far, this prototype only yields to comprehend textual result. The proposed project is to look into parsing squid 2.x and (perhaps) netcache results, and return the results in a way which is easy to incorporate into a database. Furthermore, a prototypical web interface module is to be prepared which shows how different on-the-fly views of data are achieved.

3. Deliverables

The scope of the work consists of several issues:

in co-operation with the TF-CACHE group, acquire a set of interesting yet simple information to be extracted from the logfiles, preferably based on the information already available in calamaris.
port the C++ version to parse squid-2 logfiles and preferably also netcache logfiles. The results from (1) should be incorporated into this phase.
design of an efficient scheme to store results in a relational database. A linear growth of the database might or might not be desirable; the topic will have to be discussed with TF-CACHE. If a linear growth is not desirable, a scheme to reduce older data will have to be developed.
prototypical implementation of a web interface to extract interesting views from the database and demonstration of the on-the-fly generation of graphs.

Deliverable schedule:

Phase Del. nr. Date due Title / Description

Initial D1 SD+1 Brief outline of minimal set of information which the logfile analyser should report.

Development D2 SD+3 Version of Calamaris/C++ which can parse Squid 2.x and Netcache logfiles.

Database Development D3 SD+5 Design of database to store compressed logfiles in an efficient manner. Agree format with TF-CACHE.

Web Interface D4 SD+6 Prototype web interface with automatic graph generation.

The project cut off point will be 9 months after the project starts. Any follow up work identified during the project will be considered as a new project.

4. Contribution commitments to the project

The TERENA caching task force (TF-CACHE) has a considerable number of members who are interested in the results of this project. All deliverables will be reviewed by the task force and, where possible, any recommendations will be taken into account.

5. Evaluation criteria

The results of the project will be evaluated by the TERENA caching task force (TF-CACHE) and their views reported back to the TERENA Technical Committee (TTC).

6. Change control mechanism

Modification of the project during its lifetime is subject to the following procedure:

Preliminary consensus on TF-CACHE discussion list.
Initial approval by TERENA PDO associated with TF-CACHE and the Contractors.
Final approval by TERENA Technical Committee

cache-admins@rvs.uni-hannover.de