eventio and thoughts

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Fri, 14 Sep 2001 18:33:20 +0200

 Folks,

 I've been thinking about efficient io models some time lately, and I'd
 like to share some thoughts.

 What any high-performance "C10k" io-model tries to combat is eventually
 reduce overhead that is useless for the final goal - the throughput.
 If we eliminate all disk bottlenecks, optimise all code paths and omit
 acl and regex processing, we'll still be left with quite an ample of
 overhead with network io under high loads. Most problems occur near
 socalled C10K levels, as Dan Kegel describes on his page. Actually
 already at C2K levels.
 On that page I think most of current models are covered, but none of
 them seems perfect to me, although some are pretty close.

 It seems that in current state main effort is put into resolving
 state or event notification overhead. Funny is that almost none of the
 solutions actually handles IO itself, only handles either readiness or
 completion notification. IO seems to be taken for granted.
 But its not so. Every IO syscall burns more CPU than expected, both due
 to context-switching, OS'es internal action initialising and also due
 to running process/thread scheduling checks.

 - threads are expensive if doing little work at a time. Mostly because
   of syncronisation and context switching overhead.
 - frequent poll() is expensive if only fraction of fdarray is active.
 - signals are expensive when most fds are active.
 - small data sizes are inefficient due to excessive syscalls
 - syscalls are expensive if doing little work, mostly for similar
   reasons as threads.

 Above points cut off several io-models mentioned on Kegel's page, like
 process/request (obviously), thread/request, RT-signals, select/poll.
 Based on Henrik's comments now seems that most aio implementations fall
 to thread/request model.
 Squid is using single-threaded nonblocking IO with poll notification.
 What we are trying to do is get rid of poll overhead by trying other
 methods after eventio is mature: kqueues, devpoll, rt-signals, etc.
 This will save some CPU in main squid thread, but the win won't be as
 high as we hope, I'm afraid. Squid main thread is already loaded, and all
 pure IO syscalls together with FD ioctrl calls consume quite alot.
 Eventually we'll strike syscall rate that wastes most CPU cycles in
 context-switches and cache-misses.

 The more I think about it, the more I see the need to consolidate actions.
 Poll is a wakeup function. Most io models turn around the wakeup overhead.
 There are few very nice and effective solutions for that, but handling IO
 itself is not covered, excluding perhaps in-kernel servers.

 Under extremely high loads we inevitably have to account for overhead of
 context switches between kernel and squid, and the more work we can give
 to kernel in a shot the less this overhead is notable.
 Unfortunately, There seems no method for doing this in current OS'es...

 Ideally, kernel should be given a list of sockets to work on, not in just
 terms of readiness detection, but actual IO itself. Just like in devpoll,
 where kernel updates ready fd list as events occur, it should be made to
 actually do the IO as events occur. Squid should provide a list of FDs,
 commands, timeout values and bufferspace per FD and enqueue the list.
 This is like kernel-aio, but not quite.
 From other end sleep in a wakeup function which returns a list of completed
 events, thus dequeueing events that either complete or error. Then handle
 all data in a manner coded into Squid, and enqueue another list of work.
 Again, point is on returning list of completed events, not 1 event at a
 time. Much like poll returns possibly several ready FD's.

 To get rid of wakeup (poll) overhead, we need to stop poll()ing like crazy.
 devpoll does this for us nicely. But we solve this only to be able to issue
 actual IO itself. Actually we don't really care about efficient polling,
 we care about efficient IO. And as load on Squid increases, we start
 calling IO syscalls like crazy. FD at a time, as little amount of bytes
 at a time as we have at the moment, but no more than few KB's. We do this
 to keep latency low, although under high loads this is exact reason why
 latency goes up.
 With aio, we'd omit poll altogether, but start enqueueing IO like crazy,
 and correspondingly dequeueing completed IO like crazy. More efficient,
 but only to a point. thread/aio-request overhead brings any savings down.

 I dunno, many io models are quite reminding what I'm saying, but seems
 that all they stop on a halfway. You either can register interest in
 FD events and then handle IO one at a time or enqueue/dequeue actions
 one at a time, but you can't do that in bulk.
 All this reminds a tapedrive that can only transfer small chunks of data
 at a time without ability to stream.

 I think that what is needed is some a combination of kernel-queues (or
 devpoll) and kaio. Schedule actions in bulk and dequeue in bulk.
 Together with appropriate number of worker threads any MP scalability
 can be achieved. This needs kernel support.
 I believe this is useful, because only kernel can really do work in
 async manner - at packet arrival time. It could even skip buffering
 packets in kernel space, but append data directly to userspace buffs,
 or put data on wire from userspace. Same for disk io.
 Today this can't be done (not sure of TUX). Maybe in future. I think so.
 I just wonder if anyone is working in that direction already.
 So, wakeup function is not about readiness, but directly about popping
 up completed work. And scheduled work gathered in bulks before passing
 to the hourse - kernel. Pipeline.

 In regards to eventio branch, new network API, seems it allows to
 implement almost any io model behind the scenes. What seems to stick
 is FD-centric and one-action-at-a-time approach. Also it seems that
 it could be made more general and expandable, possibly covering also
 disk io. Also, some calls assume that they are fulfilled immediately,
 no async nature, no callbacks (close for eg). This makes it awkward
 to issue close while read/write is pending.

 One thing that bothers me abit is that you can't proceed before FD
 is known. For disk io, for eg. it would help if you could schedule
 open/read/close in one shot. For that some kind of abstract session
 ID could be used I guess. Then such triplets could be scheduled to
 the same worker-thread avoiding several context-switches.
 Also, how about some more general ioControlBlock struct, that defines
 all the callback, cbdata, iotype, size, offset, etc... And is possibly
 expandable in future.
 Hmm, probably it would be even possible to generalise the api to such
 extent, that you could schedule acl-checks, dns, redirect lookups all
 via same api. Might become useful if we want main squid thread to do
 nothing else but be a broker between worker-threads. Not sure if that
 makes sense though, just a wild thought.

 Also, I think we should think about trying to be more threadsafe.
 Having one compact IOCB helps here. Maybe even allowing to pass a
 list of IOCB's.

 ouch, I waste bandwidth again...

------------------------------------
 Andres Kroonmaa <andre@online.ee>
 CTO, Microlink Online
 Tel: 6501 731, Fax: 6501 725
 Pärnu mnt. 158, Tallinn,
 11317 Estonia
Received on Fri Sep 14 2001 - 10:39:51 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:14:21 MST