Re: eventio and thoughts

From: Henrik Nordstrom <hno@dont-contact.us>
Date: Sat, 15 Sep 2001 01:41:00 +0200

Andres Kroonmaa wrote:

> What any high-performance "C10k" io-model tries to combat is eventually
> reduce overhead that is useless for the final goal - the throughput.

Also latency. The two issues (throughput and latency) are related but
not identical.

> On that page I think most of current models are covered, but none of
> them seems perfect to me, although some are pretty close.

None of the existing network I/O models are perfect, and as I said it is
suprising to see the amount of new imperfect models being proposed all
the time, often based on inclusive results.. A way too common comment in
the research is alont the lines "the other model does not perform very
well. We do not understand exactly why it performs badly but ours
perform better in the tests".

> It seems that in current state main effort is put into resolving
> state or event notification overhead. Funny is that almost none of the
> solutions actually handles IO itself, only handles either readiness or
> completion notification. IO seems to be taken for granted.

Not taken for granted, but it is a much more complex issue than
notifications. There are a couple of attempts in I/O optimization (see
zero-copy, sendfile() and other threads), but it is very hard to find a
generally suitable I/O model with low overhead... especially considering
that usually applications needs to actually act on the data, not only
read/write/forward it.

> - threads are expensive if doing little work at a time. Mostly because
> of syncronisation and context switching overhead.

true.

synchronistation is bad, and should be avoided where possible.

thread switching is also bad, and should be avoided.

End result: Use threads for horsepower scaling, not I/O design.

> - frequent poll() is expensive if only fraction of fdarray is active.

true.

> - signals are expensive when most fds are active.

well.. not entirely sure on this one. Linux RT signals is a very light
weight notification model, only the implementation sucks due to
worthless notification storms.. (if you have already received a
notification that there is data available for reading, there is
absolutely no value in receiving yet another notification when there is
more data before you have acted on the first.. and similar for writing)

> - small data sizes are inefficient due to excessive syscalls

And more: excessive network processing and in some cases even traffic.

> - syscalls are expensive if doing little work, mostly for similar
> reasons as threads.

depends on platform and syscall, but generally true. The actual syscall
overhead is however often overestimated in discussions like this. A
typical syscall consists of
  * light context switch
  * argument verification
  * data copying
  * processing

What you can optimize by aggregation is the light context switches. The
rest will still be there.

> Squid is using single-threaded nonblocking IO with poll notification.
> What we are trying to do is get rid of poll overhead by trying other
> methods after eventio is mature: kqueues, devpoll, rt-signals, etc.

correct, but we are also trying to get rid of many other old stupidities
like:
a) A large amount of copying
b) Exessive use of I/O notifications even when it can be assumed we
already know the state or don't actually need to know

> This will save some CPU in main squid thread, but the win won't be as
> high as we hope, I'm afraid. Squid main thread is already loaded, and all
> pure IO syscalls together with FD ioctrl calls consume quite alot.

true, and is why there is the other long term goal of splitting Squid
into several execution units capable of efficiently share the same
backend cache, to allow more horsepowers to be easily added in the
equation.

> Eventually we'll strike syscall rate that wastes most CPU cycles in
> context-switches and cache-misses.

The I/O syscall overhead should stay fairly linear with the I/O request
rate I think. I don't see how context switches and cache misses can
increase a lot only because the rate increases. It is still the same
amount of code running in the same amount of execution units.

> Under extremely high loads we inevitably have to account for overhead of
> context switches between kernel and squid, and the more work we can give
> to kernel in a shot the less this overhead is notable.
> Unfortunately, There seems no method for doing this in current OS'es...

Note: Networking is different from disk I/O in the sense that most of
the network processing is done by the kernel independently of the
application. This is what allows non-blocking I/O to be used for
networking. The penalty paid for this is that data must be copied
between userspace and kernelspace all the time.

> Ideally, kernel should be given a list of sockets to work on, not in just
> terms of readiness detection, but actual IO itself. Just like in devpoll,
> where kernel updates ready fd list as events occur, it should be made to
> actually do the IO as events occur. Squid should provide a list of FDs,
> commands, timeout values and bufferspace per FD and enqueue the list.
> This is like kernel-aio, but not quite.

Sounds very much like LIO. Main theoretical problem is what kind of
notification mechanism to use for good latency.

> From other end sleep in a wakeup function which returns a list of completed
> events, thus dequeueing events that either complete or error. Then handle
> all data in a manner coded into Squid, and enqueue another list of work.
> Again, point is on returning list of completed events, not 1 event at a
> time. Much like poll returns possibly several ready FD's.

I am not sure I get this part.. are you talking about I/O or only
notifications?

> To get rid of wakeup (poll) overhead, we need to stop poll()ing like crazy.

Agreed.

> devpoll does this for us nicely. But we solve this only to be able to issue
> actual IO itself. Actually we don't really care about efficient polling,
> we care about efficient IO. And as load on Squid increases, we start
> calling IO syscalls like crazy. FD at a time, as little amount of bytes
> at a time as we have at the moment, but no more than few KB's. We do this
> to keep latency low, although under high loads this is exact reason why
> latency goes up.

Now you remind me of one more thing the eventio branch changes from the
current Squid I/O model. It eleminates the "poll" unless absolutely
needed. If we have data to send it is sent unless the socket buffer is
known to be full. Similarly for reads.

> I believe this is useful, because only kernel can really do work in
> async manner - at packet arrival time. It could even skip buffering
> packets in kernel space, but append data directly to userspace buffs,
> or put data on wire from userspace. Same for disk io.

Apart from the direct userspace copy, this is already what modern
kernels does on networking..

> Today this can't be done (not sure of TUX). Maybe in future. I think so.
> I just wonder if anyone is working in that direction already.
> So, wakeup function is not about readiness, but directly about popping
> up completed work. And scheduled work gathered in bulks before passing
> to the hourse - kernel. Pipeline.

> In regards to eventio branch, new network API, seems it allows to
> implement almost any io model behind the scenes. What seems to stick
> is FD-centric and one-action-at-a-time approach. Also it seems that
> it could be made more general and expandable, possibly covering also
> disk io. Also, some calls assume that they are fulfilled immediately,
> no async nature, no callbacks (close for eg). This makes it awkward
> to issue close while read/write is pending.

eventio may well be applied to disk I/O for some models, but does not
talk about it at this time due to the wast implementation differences in
todays OS'es. How to implement efficient disk I/O is very different from
efficient network I/O in todays kernels.

Regarding the eventio close call: This does not close, it only signals
EOF. You can enqueue N writes, then close to signal that you are done.
And there is a callback, registered when the filehandle is created.
Serialization is guaranteed.

> One thing that bothers me abit is that you can't proceed before FD
> is known. For disk io, for eg. it would help if you could schedule
> open/read/close in one shot. For that some kind of abstract session
> ID could be used I guess. Then such triplets could be scheduled to
> the same worker-thread avoiding several context-switches.

The eventio does not actually care about the Unix FD. The exact same API
can be used just fine with asyncronous file opens or even aggregated
lowlevel functions if you like (well.. aggregation of close may be a bit
hard unless there is a pending I/O queue)

> Also, how about some more general ioControlBlock struct, that defines
> all the callback, cbdata, iotype, size, offset, etc... And is possibly
> expandable in future.

???

> Hmm, probably it would be even possible to generalise the api to such
> extent, that you could schedule acl-checks, dns, redirect lookups all
> via same api. Might become useful if we want main squid thread to do
> nothing else but be a broker between worker-threads. Not sure if that
> makes sense though, just a wild thought.

Define threads in this context.

> Also, I think we should think about trying to be more threadsafe.
> Having one compact IOCB helps here. Maybe even allowing to pass a
> list of IOCB's.

See previous discussions about threading. My view is that threading is
good to certain extent but locality should be kept strong. The same
filehandle should not be touched by more than one thread (with the
exception of accept()).

The main goal of threading is scalability on SMP.

The same goal can be acheived by a multi-process design, which also
scales on assymetric architectures, but for this we need some form of
low overhead shared object broker (mainly for the disk cache).

--
Henrik
Received on Fri Sep 14 2001 - 17:41:43 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:14:21 MST