Re: [squid-users] Squid Performance Issues - reproduced

From: Henrik Nordstrom <hno@dont-contact.us>
Date: Thu, 02 Jan 2003 19:45:18 +0100

Andres Kroonmaa wrote:

> > What can be said is also that the likelyhood that this won't happen
> > decreases a lot on SMP machines as the main thread then have no
> > copmetition with the I/O threads for the CPU.
>
> I'd not be sure of that. Competition for cpu is not so much of an
> issue here, its thread scheduling. Before main thread LWP does not
> reach scheduler, no IO thread rescheduling will happen. I see not
> much difference.

The likelyhood for the I/O thread to get scheduled "immediately" is
rather high on a UP machine. Same thing on a SMP machine, but on a SMP
machine the I/O thread won't steal the CPU from the main thread, making
it much less likely the I/O thread finished before the main thread gets
back into the poll/select loop.

> Signal is heavy. Can't we have one dummy FD in the fd_set that is
> always polled for read, and when IO thread is ready, write 0-1 bytes
> in there? Like pipe? That would cause poll to unblock and allows
> multiple threads to write into same FD.

Exacly.

> Thats how threads work. Wishful thinking, ;) been there too. Thread
> switch will happen only when it is unavoidable. SMP optimisations..
> For that reason, yield() is almost 100% no-op. Mutex unlock after
> cond_signal quite the same. It is about async nature of threads, OS
> "assumes" that you gonna cond_signal many threads before you block
> so that it can then schedule signalled threads on cpu (batching).

On the other hand I have other measurements which contradicts this. Most
of the time the I/O threads get scheduled "immediately" here. Still
investigating.

> The only way to reliably kick specific thread is through solid
> mutex handshaking. Even blocking in poll does not guarantee that
> after return signalled thread has been run, especially on SMP systems
> that try to keep threads from migrating cpus. If we are blocked in
> poll long enough, they all obviously would have to get to run. But
> here lies another problem, we can't slow down network IO to make sure
> aio threads get run. We may try to tweak with priorities, but thats
> not enough unless we run IO threads in realtime class, I suppose.
>
> If I recall right, cond_wait is boolean, and, if many threads are
> blocked on cond, and cond_signal does not make thread switch, and
> another cond_signal is sent, only 1 thread would be unblocked
> eventually.

yep.

> I assume that mutex is put into to-be-locked state upon
> first cond_signal (owner unspecified, but one of threads on wait),
> and second attempt to signal would cause solid thread switch to
> consume first cond (because mutex is "locked").

Documentation says mutex should be locked prior to cond signal to avoid
races.

Also seen LWP systems crash if mutex is not locked prior to cond signal.

> That most probably
> happens when there is alot of activity with IO. But when only 1
> client, thread switch would not happen until we block into poll.
> In this case yield is also nop because io threads are not yet
> scheduled to cpu, thus there is nothing to run at yield() time.
> Using yield is bad coding in SMP world, I'd suggest avoiding it.

The yield is mainly for UP systems.

> Basically, we have 2 probs to solve.

> 1) we need reliable and least
> overhead kickstart of aufs IO threads at the end of comm_poll run.
> Poll can return immediately without running scheduler if there are
> FDs ready. Forcibly blocking in poll would cause lost systick for
> network io. Therefore I think we'd need to think of some other
> way to get io-threads running before going into poll. We only
> need to make sure io-threads have grabbed their job and are on
> cpu queue. Maybe even only last cond_signal is an issue if my
> above guess is right.
>
> 2) We need semi-reliable and least latency notification of aio
> completion when poll is blocking. The latter one probably more
> important. Could the pipe FD do the trick? Signal would, but at
> high loads it would cause alot of overhead.

And of these 2 is most important I think. From what I can tell the I/O
thread latency is fine even on high loads (measured on UP), but this may
or may not be due to the yield.

The signal implementation I had was shielded for load conditions by only
signal is the main thread was in select/poll (save for races). The
signalling does not need to be foolproof as things automatically recover
by timeout (10ms). As long as it gets it mostly right then it will be
fine.

Using a pipe is also fine, and can easily be filtered for load
avoidance.

Attached is a small patch which introduces a completetion "signal" pipe.

Comments?

And yes, you seem to be correct about the yield. This should probably be
ripped out of the code to save some small amount of CPU.

What to do about '1' I do not know, but I am not sure it is actually
needed either. In such case the main thread whill get rather low
priority due to repeatedly eating up it's quantum so most schedulers
should give the I/O threads priority I think..

Regards
Henrik

Index: src/fs/aufs/aiops.c
===================================================================
RCS file: /server/cvs-server/squid/squid/src/fs/aufs/aiops.c,v
retrieving revision 1.12.2.3
diff -u -w -r1.12.2.3 aiops.c
--- src/fs/aufs/aiops.c 2 Jan 2003 05:04:27 -0000 1.12.2.3
+++ src/fs/aufs/aiops.c 2 Jan 2003 17:45:54 -0000
@@ -156,9 +156,10 @@
 static struct {
     squidaio_request_t *head, **tailp;
 } done_requests = {
-
     NULL, &done_requests.head
 };
+static int done_fd = 0;
+static int done_signalled = 0;
 static pthread_attr_t globattr;
 #if HAVE_SCHED_H
 static struct sched_param globsched;
@@ -235,9 +236,19 @@
 }
 
 static void
+squidaio_fdhandler(int fd, void *data)
+{
+ char buf[256];
+ done_signalled = 0;
+ read(fd, buf, sizeof(buf));
+ commSetSelect(fd, COMM_SELECT_READ, squidaio_fdhandler, NULL, 0);
+}
+
+static void
 squidaio_init(void)
 {
     int i;
+ int done_pipe[2];
     squidaio_thread_t *threadp;
 
     if (squidaio_initialised)
@@ -281,6 +292,15 @@
     done_queue.requests = 0;
     done_queue.blocked = 0;
 
+ /* Initialize done pipe signal */
+ pipe(done_pipe);
+ done_fd = done_pipe[1];
+ fd_open(done_pipe[0], FD_PIPE, "async-io completetion event: main");
+ fd_open(done_pipe[1], FD_PIPE, "async-io completetion event: threads");
+ commSetNonBlocking(done_pipe[0]);
+ commSetNonBlocking(done_pipe[1]);
+ commSetSelect(done_pipe[0], COMM_SELECT_READ, squidaio_fdhandler, NULL, 0);
+
     /* Create threads and get them to sit in their wait loop */
     squidaio_thread_pool = memPoolCreate("aio_thread", sizeof(squidaio_thread_t));
     for (i = 0; i < NUMTHREADS; i++) {
@@ -401,6 +421,10 @@
         *done_queue.tailp = request;
         done_queue.tailp = &request->next;
         pthread_mutex_unlock(&done_queue.mutex);
+ if (!done_signalled) {
+ done_signalled = 1;
+ write(done_fd, "!", 1);
+ }
         threadp->requests++;
     } /* while forever */
     return NULL;
Index: src/fs/aufs/async_io.c
===================================================================
RCS file: /server/cvs-server/squid/squid/src/fs/aufs/async_io.c,v
retrieving revision 1.10.2.4
diff -u -w -r1.10.2.4 async_io.c
--- src/fs/aufs/async_io.c 10 Nov 2002 12:06:07 -0000 1.10.2.4
+++ src/fs/aufs/async_io.c 2 Jan 2003 17:45:54 -0000
@@ -88,6 +88,7 @@
         fd_close(fd);
 }
 
+
 void
 aioInit(void)
 {
@@ -97,7 +98,6 @@
     cachemgrRegister("squidaio_counts", "Async IO Function Counters",
         aioStats, 0, 1);
     initialised = 1;
- comm_quick_poll_required();
 }
 
 void
Received on Thu Jan 02 2003 - 11:48:34 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:19:05 MST