Re: [squid-users] Squid Performance Issues - reproduced from Andres Kroonmaa on 2003-01-05 (squid-dev)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Sun, 05 Jan 2003 15:43:29 +0200

On 5 Jan 2003 at 0:04, Henrik Nordstrom wrote:

> Andres Kroonmaa wrote:
>
> > We gain most if we have all our IO requests pending in kernel. For that
> > all threads must be active and blocked in kernel. In current squid this is
> > done by sending cond_signal per request. Either all threads run or one
> > thread is awaken per request. If we loose signal, we loose concurrency.
>
> Actually not as much as one may think.
>
> In the assumed worst case only one new thread gets scheduled per comm
> loop. However, assuming there is concurrent traffic triggering new I/O
> events and a small queue of I/O requests due to blocked threads then the
> signals will gradually kick more and more threads alive.
>
> As a end result all threads will most likely be running if you push the
> system a little, even if only one thread is awakened per comm loop.

Yes, but thats like what our 200kb/sec case showed. We rely on other traffic.
If it isn't there, we lag. Suppose spike of requests, say 100, after which clients
just sit there and wait for data, and new poll events don't relate to aufs.
If we have spinned off 1-2 threads, they'll wait after each other. Some slow
open might block already opened fast file io. New request of one client slows
down fast client's stream.

> There is a theoretical even worse case if no resheduling takes place for
> many comm loops because there is new network events received all the
> time keeping the main thread running. But as I said previously I have no
> evidence of this happening, not to the level that it is noticeable at least.

This can't really happen for more than 1 systick. Main thread's cpu-time
expires and its taken cpu away forcibly. Things can get worse if whole
system is at load 20-90, when there is constantly lack of cpu resources.
But squid alone is unable to trigger that.

During short spikes of net-IO this might happen. Would be interesting to
detect that somehow. But quite often main thread makes a switch to aio
at condsignal time.

> The exact conditions on when and why linuxthreads reschedules "quickly"
> (i.e. before the "main" thread blocks) is not yet identified. Have only
> seen this happen in Squid, not in simpler test programs.

linux is fancy. it uses cpu profiling counters if available to measure how
much % of systick given thread has consumed. And based on that and
some weighting decides if its ok to make a thread switch or not. So it may
be like if thread has just unblocked, it will not make a thread switch, if it
has consumed about 10-20% of its systick, it might become increasingly
probable that thread switch will happen. optimisations...

As we do quite alot in commloops, it might be that first aio requests don't
make thr sw, while at the end of commloops, they do. Quite nice actually.

> I do not like delaying the initial signal. If the I/O thread can it
> should be given chance to start the execution of the I/O operation as
> soon as possible to reduce latency. But I do agree that the additional
> signalling to get more threads running if there is a queue built up can
> be moved down to the threads. Can be done as simple as only signal from
> the main thread if this is the first request, and then have the threads
> signal one additioal thread for each request they pick up while there is
> more requests in the queue, with a counter to avoid signalling all
> together if there is no idle threads (if maintaining the counter is
> cheaper than signalling a unused cond variable, not sure about this).

exactly.

> > SMP gives very little, because userspace cpu-time of IO-thread is
> > almost nil, most of the time system waits for IO hardware. But we
> > need threads for that happen. So we have IO threads. But whether
> > they are bound to same cpu as main thread or any other cpu is
> > rather irrelevant.
>
> Not entirely when discussion when a I/O thread may be scheduled to the
> CPU.

it isn't most of the time. My experience shows very little cpu usage for
second cpu on dual system when running single squid, aufs or diskd.

> > More fruitful would be make sure that thread awakening
> > and thus its overhead would be wasted outside main
> > thread, which means during its sleep-time.
>
> Perhaps, if it can be shown that this is less overhead than to keep
> track of what is needed to do it externally.

I would like to see aio req queueing extremely lightweight. To the point
that we just overwrite linklist head pointer and thats it. Let aio subsystem
wake up itself, and enough of workers to handle the request list. Imo this
would consume least of cpu for main thread, and would provide most
chances for OS to schedule all of the aio to another cpu.

> > There is question whether it makes at all sense to pass such ops to thread
> > instead of completing from main thread. If thread sw/scheduling overhead
> > is higher than cpu overhead of nonblocking io from main thread, then no.
>
> The problem is how to find out. We cannot reliably tell if an I/O
> request will block or not, but in quite many cases it won't because of
> os level caching already having the needed metadata cached in memory.

Yes, for reads sure. But writes? Unless they are raw unbuffered writes,
all that happens is buffer copy to kernel space. If its full, write would
return with eagain. Only then perhaps pass it on to threads.

> > btw, pipe is bidirectional. how about threads blocking on read from pipe?
> > How that differs from mutex/cond signalling in terms of overhead?
>
> My gut feeling is that the cond signalling is a lot cheaper than having
> multiple threads read from the same pipe.

yes, it seems. Why I thought of pipe is that its sort of cond_signal in OS,
thats buffered, ie does not cause immediate thread switch, but would
require scheduler pass. Ie. allow other cpu to kick in more likely. But it
would cause iothreads to wake up later. Seem we have to make choise,
whether no latency and iojobs start quickly but we waste main thread
cpu-time, or we have latency but main thread has more time to run.
By threading style, we should not care, but we do because our main
thread cannot utilise MP, it runs on single cpu only.

btw, interesting bit in linux manpage. They claim that if multiple threads
possibly sleep on cond_wait, cond_signal is marginally faster only "if it
can be proved that exactly 1 thread needs to be awaken only. In doubt,
use pthread_cond_broadcast." I read this so that if we need to awake
>1 thread, we'd better use cond_broadcast. Not seen such advise before.
Upon broadcast, all threads must respin on mutex, who miss the var
see spurious wakeup. Sounds like quite some waste of cpu. dunno.

> > There are spurious wakeups. To my understanding, currently there can't
> > be false signalling. All we'll see is spurious wakeups. That can be quite
> > large a number. Empty queue would mean some "old" thread managed
> > to grab the request before this thread unblocked.
>
> Yes, but also means we signalled a thread which was not needed.

No, spurious wakeups happen on every kill signal, vm faults, etc. I have
no idea how to distinguish spurious wakeup from our signaling.
And currently, old thread can't actually grab anything, because req queue
is protected by mutex.

> > Gotta put some lot of counters there. not only these. I for eg would like
> > to measure time between main thread signalling and iothread grabing req.
> > cpuProf maybe. Could show alot of how much real overhead there is.
>
> Not sure if cpuProf is at all usable in SMP. Depends on if these CPU
> based counters are included in the thread context or not..

these cpu counters are hardware, like eax. On SMP systems they tick in
sync - cpus use same clock. So they are available always, they are the
same value no matter what cpu. Only thing to solve is shared structs.
Received on Sun Jan 05 2003 - 05:29:50 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:19:05 MST