Re: Deferred reads

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Sat, 3 Jun 2000 17:58:01 +0200

On 3 Jun 2000, at 12:18, Henrik Nordstrom <hno@hem.passagen.se> wrote:

> > In this case, if we are finding that using poll() on a single FD
> > and successive read()/write() makes Squid run faster, than polling
> > all open FD's together, then we could ask why do we need the poll()
> > at all? While thinking of it I came to few thoughts.
>
> The benefit of poll is that it batches together the poll part for
> multiple filedescriptors far more efficiently than calling read() on
> them individually. However, with lots of filedescriptors even poll()
> becomes a bottleneck. However that is a CPU bottleneck, and doing
> similar amount of work in user-space won't do you any good.

 I'm just wondering if there might be some difference whether the cpu
 bottleneck happens in userland vs kernel. If kernel if written so
 that there are few shared locks, then pushing too much work on it
 may block some other tasks from proceeding. In this sense it has
 less impact on the whole system if bottleneck happens in userland.

> Didn't you se the message about comm_read/write?
 guess not.

> My idea is that poll() should only be called for if it is known that the
> filedescriptor is "blocking".
>
> write(client) (no more data to send)
> read(server)
> write(client) (again full buffer)
> poll(client)
> write(client)
> read(server) (partial reply only..)
> write(client) (not full buffer this time)

 we'd need to address a problem with that: what if neither socket ever
 blocks? Will this cause a spike of forwarding for a single session
 leaving all others to wait? True, its supposed to be very rare,
 but if a large file is pumped from fast server to fast client, this
 can happen. All it needs is some load on system that causes writes
 and reads to take more time than for the network traffic to full/empty
 the buffers. Then by the time we are ready with write, read socket
 is ready for read. So we need to break the read/write/read/write to
 allow others to speak also. Currently it is broken after every operation
 naturally by polling. I think we should settle somewhere in between.
 One way to do it is to limit amount of work to few read+writes per
 socket in a pass. loop around all sockets, and if there is noone left
 for io, go to poll() them all together.

> One simple way to make this happen is to have comm_read/write schedule
> for poll() when required, instead of having comm_poll() call
> comm_read/write.

 yes, this is what I mean by optimistic io with fallback to poll.

>> Still, if all consumers are slower than squid server-side, then we'll
>> move fast all available data from server-side buffers into client-side
>> buffers and we'll end up with situation where all server-side handlers
>> can't read more data from the network but need to wait for the slower
>> client-side to catch up.

> In real life this is rarely the case. Why:
> a) Most requests are for small objects which fits fully in the TCP/IP
> window or transmit buffers.
> b) Internet connectivity to many servers is poor.

 I disagree. If you have 1000 dialup users at about 33,6K each, you'd
 need 10-20M of internatinal link and pretty good backbone. This moves
 per-session bottleneck to client-side most of the time.
 persistent connections are only increasing, meaning potentially lots
 of traffic via single client socket. actual servers can be pretty
 many, so objects being small doesn't mean too much any more.

> > list and start servicing with optimistic io. If we have no FDs left for
> > optimistic io, we can increase poll timeout, thus reducing impact on
> > CPU and leaving more time to "gather work to do".
>
> That is the whole idea this discussion circles around.
>
> How to efficiently detect if poll is required: The previous operation
> returned partial data or EAGAIN/EWOULDBLOCK.

 Sure. I just try to look further. sure poll is required when socket
 gets EWOULDBLOCK. My guess is that we waste more CPU and gain little
 performance if we poll immediately all blocking sockets. at least one
 would probably be ready in under 1mSec, and we ask kernel to check for
 all of them. CPU is wasted to check all of them, although only small
 fraction of them gets ready. We service the ready one and poll again.
 In the end we are polling every 1mSec with say 1-3 sockets ready.
 But we could poll every 5 mSecs and get 10-15 sockets ready instead.
 for that we can insert some sleep time before actual poll()
 IMHO we can this way reduce CPU usage without actually impacting
 per-session performance.

> > Somewhat similar effect should be probably seen as with disks -
> > you get better throughput if using less ops with larger chunks
> > of data.
>
> True. However here it is quite likely more important to optimize the
> sizes of read() operations to keep a nice saw-tooth pattern in the TCP
> window sizes when congested.

 not sure what you mean. as I understand it, tcp is most efficient
 when receiving buffers are empty and transmit buffers are full.
 I'd make read size as large as possible. I think socket buffers
 are not taking any considerable memory, so I'd increase them if
 that helps.

> > every slightest activity on sockets. If we'd take it calm, we
> > possibly could increase forward latency somewhat, but whether
> > this is critical is questionable.
>
> Partly agreed. Latency is an important issue. However, sending lots of
> small packets to a congested link won't help latency, nor will delaying
> transmission of small packets on a unused link.

 I agree that large latency is an issue. but small one? look at this
 as gathering: if we don't get additional traffic in few msec, we
 give up and send whats gathered so far. If we get more traffic in
 this time, we send more in one shot and feel efficient ;)

> > btw, if we poll and service sockets at constant rate, we could
> > implement real-time traffic-shaping and rate limiting.
>
> Not sure I quite follow you there. In what way cannot this be done
> without constant rate poll()?

 not sure either ;) perhaps its because I don't understand how its
 done right now. I'd be thankful if someone describes it abit.
 With constant rate poll its just easier? If we limit a session
 to say 32K, we have 4KB per second, no more. I guess its no good
 for tcpip if we let through say 3 fullsized packets and then defer
 the transmission for 1+secs. Its also no good if we send 1,5KB
 in one pass, then after 0.5mSec find that we can send 2 more bytes,
 then after 30mSecs find that we can send 120 bytes, then after
 500mSecs that 2000 bytes.
 See, I understand so that without constant rate poll (service)
 we end up with either random sized gaps between packets or
 random sized data packets, which I think are both bad for tcpip
 efficiency. We want to have constant sized packet at constant
 rate. Of course, we can arrange for that without constant rate
 poll too, but it would be less straightforward, imho.

> > Also, defering of actual io would have much less impact as
> > we'd need to reevaluate deferral state after constant
> > interval in the future.
>
> Deferal based on buffer overflows does not need to be reevaluated. It
> will be "immediately" known when the situation has ended.

 yes, but deferral based on rate limiting is tied to time. And this
 will need to be evaluated.

> > Say we define that we poll all sockets every 5 mSec. We startup
> > loop, note subsec time and poll all sockets with zero timeout,
> > service all thats ready, note subsec time, find amount of time
> > left before next poll should be done, and sleep for that time,
> > then close the loop on next interation. The timeout of the loop
> > itself we measure separately.
>
> As the amount of I/O builds up the processing time will soon approach
> the poll interval.

 this means that delay-poll time reaches zero and squid is saturated.
 I'd omit accepting more request in that case, for eg.

------------------------------------
 Andres Kroonmaa <andre@online.ee>
 Network Development Manager
 Delfi Online
 Tel: 6501 731, Fax: 6501 708
 Pärnu mnt. 158, Tallinn,
 11317 Estonia
Received on Sat Jun 03 2000 - 09:59:55 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:28 MST