Re: Deferred reads

From: Henrik Nordstrom <hno@dont-contact.us>
Date: Sat, 03 Jun 2000 12:18:30 +0200

Andres Kroonmaa wrote:

> So, in my understanding OS read() is actually poll()+_read() anyway.
> In this case, if we are finding that using poll() on a single FD
> and successive read()/write() makes Squid run faster, than polling
> all open FD's together, then we could ask why do we need the poll()
> at all? While thinking of it I came to few thoughts.

The benefit of poll is that it batches together the poll part for
multiple filedescriptors far more efficiently than calling read() on
them individually. However, with lots of filedescriptors even poll()
becomes a bottleneck. However that is a CPU bottleneck, and doing
similar amount of work in user-space won't do you any good.

> Probably we should loop in commSelect and optimistically launch
> reader handlers, followed by related writer handlers? We skip poll,
> we just try to read from the origin server, and if that succeeds,
> we try to write to a client. If that succeeds also, we have a very
> fast forward of data, if it does not succeed, we just skip to the
> next receiving FD and return to that one in the next loop run.

Didn't you se the message about comm_read/write?

My idea is that poll() should only be called for if it is known that the
filedescriptor is "blocking".

Basically the thread I want is:

poll(accept) for client connect
accept()
poll(client) for request data
read()
connect(server)
poll(server) for connection
write(server)
poll(server) for response
read(server)
write(client)
poll(server) for mor data
read(server)
write(client) (buffer full, blocking)
poll(client)
write(client) (no more data to send)
read(server)
write(client) (again full buffer)
poll(client)
write(client)
read(server) (partial reply only..)
write(client) (not full buffer this time)
poll(server)
read(server)
write(client)

Each poll() statement here is a comm_poll loop roundtrip where a
multiple of sockets are checked, not only this one. What is important is
that both the client and server are rarely scheduled for poll() at the
same time, and no poll() in between the two..

One simple way to make this happen is to have comm_read/write schedule
for poll() when required, instead of having comm_poll() call
comm_read/write.

> Well, this could result in hell of a fast proxy, but obviously with
> 100% cpu usage at all times. If that is a dedicated box, then this
> could be even ok. if it is not dedicated, we could use "nice" for
> squid process.

I am afraid the kernel might require some available CPU time for various
background tasks like TCP/IP processing.. On many OS:es system time
blocks much of the other kernel processing.

> Obviously, we burn CPU uselessly for most of the time. But does such
> waste have any additional bad sideeffects that would influence the
> performance?

It might.

> For one thing, context switches skyrocket. How much overhead does
> this take in reality? Is high context-switch rate eroding usable CPU
> time measureably (with modern CPUs)?

Context switching eats system time. See above.

> Then, we probably can't hope that disk io returns with EAGAIN and
> no delay.

True. Few if any implements non-blocking disk I/O. That is why we have
async-io and diskd these days.

> In most cases the process is blocked for the duration of
> real disk io. This means that network io is handled between disk io
> operations only, and total squid throughput is directly related to
> amount of disk io (even for sessions not needing disk io).

And with async-io or diskd the disk I/O is dependent on getting free CPU
time to do it's processing, putting you in an odd situation hard to
balance.

> Still, if all consumers are slower than squid server-side, then we'll
> move fast all available data from server-side buffers into client-side
> buffers and we'll end up with situation where all server-side handlers
> can't read more data from the network but need to wait for the slower
> client-side to catch up.

In real life this is rarely the case. Why:
a) Most requests are for small objects which fits fully in the TCP/IP
window or transmit buffers.
b) Internet connectivity to many servers is poor.

> So, we'd like to add some sort of blocking on
> write (to buffers) internally in squid (defer) until slower parts are
> catching up. Also, if server-side is slow, there seems no point in
> trying to read from the socket for some 1000 times with EAGAIN. So,
> we'd like to move such slow sockets to some sort of waiting state.

True. And is usually the case.

> If we have no sockets ready for io, we don't want to burn CPU uselessly,
> we'd rather relinquish cpu to other processes, yet we want to be
> notified as soon as we have work to do.
> To poll() is quite obvious choice.

Yes. And quite good at it.

> We could build commSelect loop not around poll, but around optimistic
> io with a fallback to poll. Suppose we start servicing sockets with
> optimistic io, and if detect EAGAIN for few times in a row, then we add
> this FD to a polling list. At the start of each loop, we poll this list
> with zero timeout to see if there are any FD's coming out of waiting
> state. Those that are ready for io, we can take out of this polling
> list and start servicing with optimistic io. If we have no FDs left for
> optimistic io, we can increase poll timeout, thus reducing impact on
> CPU and leaving more time to "gather work to do".

That is the whole idea this discussion circles around.

How to efficiently detect if poll is required: The previous operation
returned partial data or EAGAIN/EWOULDBLOCK.

> Basically, we should classify sockets into busy and idle sockets, omit
> polling for known busy sockets and poll idle sockets together with busy
> (lately) sockets.

I prefer to look at it like a state machine with different kinds of
transitions, where poll() is one type of transition.

> We'll end up with pretty much the same implementation as we have now,
> but somewhat more complex. Is there any point in doing so? Would we
> gain anything compared to the current design?

Not sure the design needs to be much more comples. Different sure, but
not complex.

CPU is a bottleneck. By limiting the amount of CPU we burn away on doing
nothing we move the limit of the number of requests/second/cpu-power we
can process upwards.

> in other words, we should try to avoid polls that results in
> immediate return with very few ready sockets. We would like to have
> poll return with many sockets ready.

;-)

> Perhaps there is even point in adding artificial few-mSec sleep
> before polling of all sockets in the current design. The reasoning
> would be to allow OS gather incoming traffic from network and allow
> it to flush outgoing traffic from buffers, thus allowing for more
> sockets be ready at next poll and move data in larger chunks.
> Somewhat similar effect should be probably seen as with disks -
> you get better throughput if using less ops with larger chunks
> of data.

True. However here it is quite likely more important to optimize the
sizes of read() operations to keep a nice saw-tooth pattern in the TCP
window sizes when congested.

> If I understand right "Server-side network read() size histograms'
> in cachemgr then over 60% of reads from network are under 2Kb.
> Seems that there is no need to have much larger socket buffers
> in squid. At the same time we know that to get decent performance,
> tcp stack should be able to accept windows of 64K or more (for SAT
> links much more), and most probably real squid cache would be
> tuned to that. so, we can assume quite safely, that tcp stack
> would be able to buffer at least 32K without any trouble.
> Similar on the client side, squid would be tuned to be ready to
> buffer upto 64K of data in tcp stack. So we really don't need
> to pump data few bytes at a time, we could reduce rate and
> increase amount of data pumped at a time.

True. The trick is to know that there is enought data in the client
buffers to fill the wire. If this is known then we can for sure back off
a little and let the OS perform all the work for a while. This will only
happen for "large" downloads to slow clients.

> Simple look at tcpdump output suggests that most sessions are
> occuring few packets at a time with only 0.5-5 mSecs apart
> before tcp-ack is awaited. This means that we can decide to either
> handle every packet as soon as it arrives, or "wait" alittle
> until a bunch gets here and then handle them together. Current
> code seems to do it asap, meaning that we jump back&forth for
> every slightest activity on sockets. If we'd take it calm, we
> possibly could increase forward latency somewhat, but whether
> this is critical is questionable.

Partly agreed. Latency is an important issue. However, sending lots of
small packets to a congested link won't help latency, nor will delaying
transmission of small packets on a unused link.

> What kind of bad impact this could have if we poll all sockets no
> more frequently than say every 2-5 mSec? Given that we'd flush
> upto 4-32K in a go, bandwidth for large objects isn't a problem.
> For small objects, we could increase latency from 1 Msec to 6,
> is that detectable by a human browser?

Don't know.

> btw, if we poll and service sockets at constant rate, we could
> implement real-time traffic-shaping and rate limiting.

Not sure I quite follow you there. In what way cannot this be done
without constant rate poll()?

> Also, defering of actual io would have much less impact as
> we'd need to reevaluate deferral state after constant
> interval in the future.

Deferal based on buffer overflows does not need to be reevaluated. It
will be "immediately" known when the situation has ended.

> Say we define that we poll all sockets every 5 mSec. We startup
> loop, note subsec time and poll all sockets with zero timeout,
> service all thats ready, note subsec time, find amount of time
> left before next poll should be done, and sleep for that time,
> then close the loop on next interation. The timeout of the loop
> itself we measure separately.

As the amount of I/O builds up the processing time will soon approach
the poll interval.

/Henrik
Received on Sat Jun 03 2000 - 04:21:31 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:28 MST