Re: Deferred reads

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Sat, 3 Jun 2000 00:39:08 +0200

On 31 May 2000, at 11:13, Henrik Nordstrom <hno@hem.passagen.se> wrote:

> > Since this is part of the commloops development, I'm going to
> > start removing these and replacing them with suitable interfaces.
> > Can anyone think of a reason to keep deferred reads?
>
> Not if you have properly sheduled read/writes like we have discussed
> before a couple of times (See for example
> http://www.squid-cache.org/mail-archive/squid-dev/199911/0049.html).

 these posts made me think of why and how is using of polling useful.

 ...sorry for a long rant, I sort of tend to think aloud... and more
 than to say I'd like to learn, so if I'm spitting crap, beat me. ;)

 What poll() as an OS function really does? I understand that all it
 does is to find file structure related to FD and tests if the buffers
 are either filled or empty and return that fact. All of it is a fast
 memory traversing task. What does a read() do? alot redundant, ie.
 it has to find the file structure related to FD, test if there is any
 data ready or buffers free and either return with EAGAIN or block,
 or copy data to/from the buffers and update the pointers (sure it
 does alot more, but it also does the tests first)
 So, in my understanding OS read() is actually poll()+_read() anyway.
 In this case, if we are finding that using poll() on a single FD
 and successive read()/write() makes Squid run faster, than polling
 all open FD's together, then we could ask why do we need the poll()
 at all? While thinking of it I came to few thoughts.

> A typical call pattern for a data forwarding operation today is
> basically:
>
> 0. poll the receiving socket for read
> 1. Read from the receiving socket
> 2. Mark the sending socket for poll
> 3. poll the sending socket for write
> 4. write to the sending socket
> 5. Mark the receiving socket for poll
> 6. poll the receiving socket for read
> [now back to 1]

 Probably we should loop in commSelect and optimistically launch
 reader handlers, followed by related writer handlers? We skip poll,
 we just try to read from the origin server, and if that succeeds,
 we try to write to a client. If that succeeds also, we have a very
 fast forward of data, if it does not succeed, we just skip to the
 next receiving FD and return to that one in the next loop run.

 Well, this could result in hell of a fast proxy, but obviously with
 100% cpu usage at all times. If that is a dedicated box, then this
 could be even ok. if it is not dedicated, we could use "nice" for
 squid process.
 Other than "not elegant" are there any more important reasons not to
 do so?

 Obviously, we burn CPU uselessly for most of the time. But does such
 waste have any additional bad sideeffects that would influence the
 performance?

 For one thing, context switches skyrocket. How much overhead does
 this take in reality? Is high context-switch rate eroding usable CPU
 time measureably (with modern CPUs)?

 Then, we probably can't hope that disk io returns with EAGAIN and
 no delay. In most cases the process is blocked for the duration of
 real disk io. This means that network io is handled between disk io
 operations only, and total squid throughput is directly related to
 amount of disk io (even for sessions not needing disk io). this
 could be a needless limiting factor. If we move disk io to async-io
 with either threads or helper processes, then squid will have more
 time to service sessions with network-only traffic.

 Still, if all consumers are slower than squid server-side, then we'll
 move fast all available data from server-side buffers into client-side
 buffers and we'll end up with situation where all server-side handlers
 can't read more data from the network but need to wait for the slower
 client-side to catch up. So, we'd like to add some sort of blocking on
 write (to buffers) internally in squid (defer) until slower parts are
 catching up. Also, if server-side is slow, there seems no point in
 trying to read from the socket for some 1000 times with EAGAIN. So,
 we'd like to move such slow sockets to some sort of waiting state.

 If we have no sockets ready for io, we don't want to burn CPU uselessly,
 we'd rather relinquish cpu to other processes, yet we want to be
 notified as soon as we have work to do.
 To poll() is quite obvious choice.

 We could build commSelect loop not around poll, but around optimistic
 io with a fallback to poll. Suppose we start servicing sockets with
 optimistic io, and if detect EAGAIN for few times in a row, then we add
 this FD to a polling list. At the start of each loop, we poll this list
 with zero timeout to see if there are any FD's coming out of waiting
 state. Those that are ready for io, we can take out of this polling
 list and start servicing with optimistic io. If we have no FDs left for
 optimistic io, we can increase poll timeout, thus reducing impact on
 CPU and leaving more time to "gather work to do".

 Basically, we should classify sockets into busy and idle sockets, omit
 polling for known busy sockets and poll idle sockets together with busy
 (lately) sockets.

 We'll end up with pretty much the same implementation as we have now,
 but somewhat more complex. Is there any point in doing so? Would we
 gain anything compared to the current design?

 I think that by omitting poll of busy sockets we can reduce load on
 kernel that has to check and return status of each and every socket
 passed with poll(). Especially if there are few busy sockets and lots
 of idle sockets. I believe that poll of single FD or a subset of FD's
 is faster because kernel has to find and traverse much less file
 structures and critical sections to determine if the socket is ready.
 Therefore, if we simply split pollable sockets into groups and poll
 them in sequence, the end result performance shouldn't differ too much
 from what its now.

 1. if we know socket is busy and probably ready for io, we want to
    avoid polling of all others idle sockets just to make sure.
 2. we should resort to poll only if we are idle or we think we have
    rechecked other idle sockets too long ago.

 in other words, we should try to avoid polls that results in
 immediate return with very few ready sockets. We would like to have
 poll return with many sockets ready.

 Perhaps there is even point in adding artificial few-mSec sleep
 before polling of all sockets in the current design. The reasoning
 would be to allow OS gather incoming traffic from network and allow
 it to flush outgoing traffic from buffers, thus allowing for more
 sockets be ready at next poll and move data in larger chunks.
 Somewhat similar effect should be probably seen as with disks -
 you get better throughput if using less ops with larger chunks
 of data.
 If I understand right "Server-side network read() size histograms'
 in cachemgr then over 60% of reads from network are under 2Kb.
 Seems that there is no need to have much larger socket buffers
 in squid. At the same time we know that to get decent performance,
 tcp stack should be able to accept windows of 64K or more (for SAT
 links much more), and most probably real squid cache would be
 tuned to that. so, we can assume quite safely, that tcp stack
 would be able to buffer at least 32K without any trouble.
 Similar on the client side, squid would be tuned to be ready to
 buffer upto 64K of data in tcp stack. So we really don't need
 to pump data few bytes at a time, we could reduce rate and
 increase amount of data pumped at a time.
 Simple look at tcpdump output suggests that most sessions are
 occuring few packets at a time with only 0.5-5 mSecs apart
 before tcp-ack is awaited. This means that we can decide to either
 handle every packet as soon as it arrives, or "wait" alittle
 until a bunch gets here and then handle them together. Current
 code seems to do it asap, meaning that we jump back&forth for
 every slightest activity on sockets. If we'd take it calm, we
 possibly could increase forward latency somewhat, but whether
 this is critical is questionable.

 What kind of bad impact this could have if we poll all sockets no
 more frequently than say every 2-5 mSec? Given that we'd flush
 upto 4-32K in a go, bandwidth for large objects isn't a problem.
 For small objects, we could increase latency from 1 Msec to 6,
 is that detectable by a human browser?

 btw, if we poll and service sockets at constant rate, we could
 implement real-time traffic-shaping and rate limiting. Also,
 defering of actual io would have much less impact as we'd need
 to reevaluate deferral state after constant interval in the
 future.

 Say we define that we poll all sockets every 5 mSec. We startup
 loop, note subsec time and poll all sockets with zero timeout,
 service all thats ready, note subsec time, find amount of time
 left before next poll should be done, and sleep for that time,
 then close the loop on next interation. The timeout of the loop
 itself we measure separately.

 ?

------------------------------------
 Andres Kroonmaa <andre@online.ee>
 Network Development Manager
 Delfi Online
 Tel: 6501 731, Fax: 6501 708
 Pärnu mnt. 158, Tallinn,
 11317 Estonia
Received on Fri Jun 02 2000 - 16:41:10 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:28 MST