Ideas for the future of Squid

From: Henrik Nordstrom <hno@dont-contact.us>
Date: Sun, 26 Mar 2000 01:17:50 +0100

As you all probably know Squid has a number of performance bottlenecks.
Disk I/O is one and is currently being addressed, but there obviously
are other major bottlenecks as well, presumabely in the
networking/select/poll part, and in the amount of data copying and
header parsing done.

The memory footprint/disk size ratio is currently also way to large,
making memory a major investment for any large size cache. Memory is
also quite hard to scale up once the system is installed, making system
growth quite painful.

Stability is also a problem. Today a single fault in any part of the
code can bring the whole process down to a quite lengthy restart and
breaks all ongoing/pending requests.

Squid consists of a number of major function components:

* The client interface accepting requests from the clients

* The object database, keeping track of the current cache contents

* Different protocol modules for retreiving contents

* Storage policy

* On-disk storage/retreial

* Access control

Around these functions there is also a large library of supporting code

* DNS resolving / caching

* Redirectors

* Proxy authentication verification

* Memory handling

* and lots more

I think this should be divided into a number of "independent" processes:

* A number of networking processes accepting client requests, access
control and fetching content from other servers.

* A number of disk processes (one per fs/spindle)

* A process for DNS, proxy auth caching, long term client statistics
(for delay pools) and other shared services.

* A master process monitoring all the parts, restarting components as
neccesary.

Main problem with having multiple processes is how to make efficient
inter-process calls, and to do this we probably have to make a large
sacrifice in portability. Not all UNIX:es are capable of efficient
inter-process communication at the level required, and most requires
some tuning. However, if layered properly we might be able to provide
full portability with the sacrifice of some performance on platforms
with limited IPC capabilities.

The object database I'd like to see distributed to the disk processes,
where each process maintains the database for the objects it has, with
only a rought estimate (i.e. like a cache digest) collected centrally.

Any IPC should to be carefully planned and performed at a macro level
with as large operations as feasible, with proper error recovery in case
one of the components fail. If a networking process fails only the
requests currently processed by that process should be affected,
similary if a disk process fails only the requests currently served from
that disk process should be affected.

For DNS/proxy_auth/whatever else some limited distributed caching in the
networking processes might be required to cut down on the number of IPC
calls, but the bulk of these caches should be managed centrally.

This requires a number of major transitions of the code desing. For
example there will be no globally available StoreEntry structure to
connect things together.

Am I onto track to something here, or am I completely out dreaming?

/Henrik
Received on Sat Mar 25 2000 - 17:18:35 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:22 MST