Re: [squid-users] Squid Performance Issues - reproduced from Henrik Nordstrom on 2003-01-04 (squid-dev)

From: Henrik Nordstrom <hno@dont-contact.us>
Date: Sun, 05 Jan 2003 00:04:11 +0100

Andres Kroonmaa wrote:

> We gain most if we have all our IO requests pending in kernel. For that
> all threads must be active and blocked in kernel. In current squid this is
> done by sending cond_signal per request. Either all threads run or one
> thread is awaken per request. If we loose signal, we loose concurrency.

Actually not as much as one may think.

In the assumed worst case only one new thread gets scheduled per comm
loop. However, assuming there is concurrent traffic triggering new I/O
events and a small queue of I/O requests due to blocked threads then the
signals will gradually kick more and more threads alive.

As a end result all threads will most likely be running if you push the
system a little, even if only one thread is awakened per comm loop.

Remember that the threads only block on the cond if there is no pending
I/O requests.

There is a theoretical even worse case if no resheduling takes place for
many comm loops because there is new network events received all the
time keeping the main thread running. But as I said previously I have no
evidence of this happening, not to the level that it is noticeable at
least.

> But if this can cause thread switch, then this eats cpu of main
> thread. And if actual IO request is so short in userspace cpu terms,
> we don't want to waste main thread time on scheduling.

Agreed, but from what I can see based on linuxthreads the CPU overhead
of cond_signal is very small, and does not by itself trigger a
reschedule.

The exact conditions on when and why linuxthreads reschedules "quickly"
(i.e. before the "main" thread blocks) is not yet identified. Have only
seen this happen in Squid, not in simpler test programs.

I do not like delaying the initial signal. If the I/O thread can it
should be given chance to start the execution of the I/O operation as
soon as possible to reduce latency. But I do agree that the additional
signalling to get more threads running if there is a queue built up can
be moved down to the threads. Can be done as simple as only signal from
the main thread if this is the first request, and then have the threads
signal one additioal thread for each request they pick up while there is
more requests in the queue, with a counter to avoid signalling all
together if there is no idle threads (if maintaining the counter is
cheaper than signalling a unused cond variable, not sure about this).

Anyway, considering the assumtion that when the I/O system is busy all
threads should be blocked most of the time I think the current scheme
works out quite well as for the signalling aspect, assuming signalling
an empty cond variable is not as heavy operation..

Quick check.. cond_signal of an empty cond is a rather heavy operation
and should be avoided if possible. Changes things a little.

> SMP gives very little, because userspace cpu-time of IO-thread is
> almost nil, most of the time system waits for IO hardware. But we
> need threads for that happen. So we have IO threads. But whether
> they are bound to same cpu as main thread or any other cpu is
> rather irrelevant.

Not entirely when discussion when a I/O thread may be scheduled to the
CPU.

> More fruitful would be make sure that thread awakening
> and thus its overhead would be wasted outside main
> thread, which means during its sleep-time.

Perhaps, if it can be shown that this is less overhead than to keep
track of what is needed to do it externally.

> There is question whether it makes at all sense to pass such ops to thread
> instead of completing from main thread. If thread sw/scheduling overhead
> is higher than cpu overhead of nonblocking io from main thread, then no.

The problem is how to find out. We cannot reliably tell if an I/O
request will block or not, but in quite many cases it won't because of
os level caching already having the needed metadata cached in memory.

The problem is that when they block they can block for quite extended
period of time.

> What I tried to craft was design where main thread signals only once
> a fact that there is request list waiting for service - one thread switch.
> Then main thread could continue or poll, and thread would pop 1 req
> and if its not last, signal one more thread before going into kernel. That
> way just before blocking on io, thread would spin off follower. Even if
> this signalling eats cpu, it would happen more likely during time when
> main thread has nothing left to do but poll.

Yes, a reasonable scheme can probably be designed to avoid signalling
from the main thread if the first signal has not been fully dealt with
yet. The mutex is already there so adding a counter is not expensive,
problem is to find a suitable filter on when to signal.

> btw, pipe is bidirectional. how about threads blocking on read from pipe?
> How that differs from mutex/cond signalling in terms of overhead?

My gut feeling is that the cond signalling is a lot cheaper than having
multiple threads read from the same pipe.

The patch proposed makes very sparse use of the pipe, using it as a
boolean signal with only the purpose of unblocking the poll/select loop.
Only the first thread finishing will signal the pipe.

Reminds me that there is quite some optimization to be done there in
filtering of the pipe signal to avoid "false" signals triggering the
comm loop for events already dealt with. Taking care of that now.

> There are spurious wakeups. To my understanding, currently there can't
> be false signalling. All we'll see is spurious wakeups. That can be quite
> large a number. Empty queue would mean some "old" thread managed
> to grab the request before this thread unblocked.

Yes, but also means we signalled a thread which was not needed.

> Gotta put some lot of counters there. not only these. I for eg would like
> to measure time between main thread signalling and iothread grabing req.
> cpuProf maybe. Could show alot of how much real overhead there is.

Not sure if cpuProf is at all usable in SMP. Depends on if these CPU
based counters are included in the thread context or not..

Regards
Henrik
Received on Sat Jan 04 2003 - 16:31:28 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:19:05 MST