Re: Async I/O

From: Henrik Nordstrom <>
Date: Mon, 14 Sep 1998 06:25:54 +0200

Alex Rousskov wrote:

> Let me clarify. When multiple IOs are _queued_ aioRead/Write routines
> _detect_ this condition by scanning all pending requests(!) and return
> WOULDBLOCK to diskHandleWriteComplete(). diskHandleWriteComplete registers

Now I understand what you was refering to. This is if the I/O queue
to large for async_io.c to handle (more than SQUID_MAXFD pending I/O
operations in total). It is not when there is more than one outstanding
I/O operation for the same FD.

Note that the "queue" in async_io.c is only a pool of free request
structures, not limited to any particular file descriptor.

If you are seeing this pool dry out then there is certainly a large
problem somewhere.

> > And I do not beleive select is currently used on async-io disk
> > operations..
> See above. Please correct me if I am wrong.

See above.

> How does it "wait" then??

By having write() queued in disk.c:file_write, and never more than one
pending read() per fd by Squid design.

When a write() is completed diskHandleWriteComplete is called which
reschedules a aioWrite call. No select() here.

> > We do not need something perfect. Something reasonable is enought.
> Sure. Although I would prefer "perfect" if possible :)

How do you define perfect? ;)

> OK. Separate queues are not 100% necessary, IMO, but relatively easy to
> support. Tuning will be harder though. Note that we need to limit not just
> the number of threads per disk (the easy part), but the queue length per disk
> (the cumbersome knob).

True, but IMO having separate threads (and separate queues) simplifies
matters a bit.

> > * Multiple outstanding writes are joined into one (I think this is
> > already partially done..), or handled as one large writev().
> > * Read prefetching is possible when time permits, as the thread can
> > maintain some state info about the files it maintains.
> Agree. IMO, both read/write calls should be linked to the memory buffer and
> perform according to the space/data available. This will be tricky to
> implement, but not impossible, I hope. However, note that this approach
> increases threads understanding about the rest of Squid, which could be
> objected by some.

It does not nessesarily increase the need for understanding in the
not more than the average OS understands about how a application uses
data sent/received by read/write anyway.

read prefetching can as easily be initiated by the main thread as by the
I/O thread. Actually it is preferable to have it initiaded by the main
thread to isolate knowledge and memory use.

I think the hardest part is to ensured that the memory is not reused
until the call is completed. The current async-io code copies all data
between the main thread and I/O threads, avoiding any locking of memory

I do not like that async-io does all these copies, but it is a simple
implementation small minimal impact on the base code.

> > * In order for a object to be cached on disk there must be at least
> > one available thread. If there is no thread available on the most
> > wanted cache_dir then try the next. Repeated until all available
> > cache_dirs are tried.
> Disagree. From my experiments waiting for a idle thread is absolutely
> prohibitive since it leads to severe under-utilization. All threads should
> have something queued. The max length of the queue is a tuning parameter. The
> queues should have priorities (the response_time-hit tradeoff discussed
> before).

Posibly True. Regard that as a definition of available (I didn't say
idle ;-)

In my tests I gained more from having a huge number of threads than from
queueing several operations on one thread, but this was without the
disk bypass issue, and on a platform where context switches is quite

> > * Select/poll should not be used for disk files (I don't think the
> > current async-io code uses select either)
> Agree 100%.
> Moreover, if possible we should avoid all sync-IOs when async-io is enabled.
> The current code is happy to call sync-ios if something goes a bit wrong.

Are you refering to the select() issue here, or something else?

> Not exactly sure what you mean here, but I think this is covered by
> read/write section above.

Today all swapin/outs are done in 8K chunks and I still beleive that
there is a gain in having larger I/O chunks for async-io.

> > Effects:
> > 1. Swapouts are not initiated on saturated disks
> Yes, but we disagree on what "saturated" means.

No we don't. Saturated is when the disks response time is to high.
This is most easily measured in number of outstanding operations (on
threads + queued for threads).

To maintain priorities between operations you need queues. To actually
saturate the disk you only need a number of threads banging at it. The
balance between number of threads and queue size is a matter of CPU
usage and how the threads interact with the I/O queues.

> > 2. The I/O queue for one disk is limited by the number of assigned
> > threads, giving a somewhat guaranteed maximum service time.
> Yes, the wait queue should be limited to a [configurable] maximum (definitely
> not zero!).

It never is zero. Even if you do not queue requests waiting for idle
threads, the threads themselves queues I/O operations on the disk.
But I do agree that there may be performance gained from queuing
operations waiting for idle threads.

> > 6. The current "magic" (round-robin of some of the less filled disks)
> > for disk load balancing is unneeded. Simpy try to fill the first
> > available disk with most available space.
> Will have to think about that. Load/space balancing is tricky..

If we get everything correct with disk queue limiting then there is
no problem if one disk takes all the load, as long as hits from it
are served within the specified limit.

If we then define when a disk is available then we get three cases:
1) There is a idle thread
2) The wait queue for the disk is not to large
3) The disk is saturated (to large queue)

then this can be used to get a much more even distribution that extends
to all disks when load grows, regardless of space distribution.
a) the disk that has most free space and a idle thread
b) the disk that has most free space and not saturated
(or some weighting between (a) and (b) )
c) bypass disk

> > * How should HITs be handled on saturated disks?

> I think this should be configurable. Some people pay $$$ for outgoing
> bandwidth, and they will kill for a hit. Some will get killed if response
> time is too slow. :)


> Overall, I think we agree on major issues. I will do more testing in the
> beginning of the week, and will be back with some concrete stuff shortly
> after.

Yes, we do agree on most things, except when we don't understand
each other or the current code.

> Look who is talking! The huge-out-of-the-blue-patch man! ;-)

That patch should be seen as a statement. I do hate the client
with pump.c.

Received on Tue Jul 29 2003 - 13:15:53 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:55 MST