Re: Thoughts about move work off the main squid thread to achieve parallelism

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Tue, 08 Mar 2011 10:20:16 -0700

On 03/07/2011 12:35 PM, Ming Fu wrote:

> My Name is Ming Fu. I have worked on squid on and off since 2000.

> My current interesting is to improve the performance of squid 3.

> What I am think of doing is to move the none cache-able processing
> off the main squid thread.

> My assumption is the following:

> 1. significant portion of reply from web server are not cache-able.

Agreed.

> 2. Off load not cache-able processing off the main squid thread can
> save some CPU load from the main squid thread. This is similar to
> what is already happening on disk write and unlink.

Not exactly. Removing CPU work from the main Squid thread is only
helpful if there is a spare CPU core to which that work can be moved to
(and if moving/synchronization does not cost more than we gain from the
added parallelism). Using multiple CPU cores with low synchronization
overheads is what SMP Squid already does.

(Also keep in mind that direct disk access block Squid process, often
for a long time, wasting available CPU cycles. Non-cachable processing
does not do that. You can consider code execution as "CPU blocking", in
which case the above reasoning about multiple cores still applies).

> Two approach I can think of:

> 1. move the processing of not cache-able reply to separate threads,
> these threads not need to access the cache.

> 2. Push the work down to the kernel's socket layer. Some kind of
> kernel filter that is able to associate two sockets and copy the in
> coming data from one socket to another. The squid establishes the
> association and provide information for the kernel filter to tell the
> end of a reply (chunked encoding or content-length). The kernel
> breaks the association when one reply is processed and squid regains
> the control of the sockets.

> The option 2 could potentially be faster than option 1, but will be
> depends on the OS platform. I come from a BSD background, I have some
> confidence that this will be possible for FreeBSD.

You are correct: Non-cachable responses currently suffer from some
caching-related overheads. Removing those overheads would help make
Squid faster.

I do not think it is a good idea to somehow move processing of
non-cachable responses to a different thread or process because the
problem is _not_ that non-cachable responses are blocked on cachable
responses (there should be no blocking disk I/O in a
performance-sensitive Squid worker, even if it caches). The problem is
that non-cachable responses have to go through some useless (for them)
caching code.

Removal of that useless processing is the right solution, IMO. Moving
transactions to a different process or thread (beyond what SMP Squid
already does) will just add overheads.

The primary obstacle towards the optimization you want is Squid
assumption that all objects come from server-side through Store. This
adds a lot of needless processing for non-cachable responses, including
multiple memory copies.

IMO, a better design would be to make the server side capable of feeding
the responses to the client side directly. In other words, separate
"subscribe to receive response" interface from Store and have both Store
and server-side implement that interface.

Moreover, we have already done pretty much the same thing for requests:
The server-side code can receive requests from multiple sources
(client-side, ICAP, eCAP), without really knowing where the request is
coming from. I believe the same should be done for handling traffic in
the opposite direction.

Many pieces of the required interface are already implemented and can be
reused. If you want to work on this, let's discuss specifics!

As for TCP slicing, sendfile(), and other low-level optimizations, they
can happen on top of the streamlined processing outlined above. As Amos
has already noted, those optimizations will need to be mindful of ACLs,
adaptation, and other code that wants to retain some control over
response handling, but not all environments have such code enabled.

Moreover, we may use the same low-level optimization for to-HTTP,
to-ICAP, and from-ICAP traffic streams as well! The key is to have a
single "message passing" interface mentioned above so that you can
insert low-level optimizations between any appropriate "sides" without
duplicating optimization or sides code.

Cheers,

Alex.
Received on Tue Mar 08 2011 - 17:20:36 MST

This archive was generated by hypermail 2.2.0 : Thu Mar 10 2011 - 12:00:04 MST