Re: Async I/O

From: Alex Rousskov <rousskov@dont-contact.us>
Date: Sun, 13 Sep 1998 19:55:27 -0600 (MDT)

On Mon, 14 Sep 1998, Henrik Nordstrom wrote:

> Alex Rousskov wrote:
>
> > Right, when multiple IOs are detected on a FD, the async-code
> > panics and starts registering that FD with select, effectively
> > "waiting" for a thread to process the IO that was queued first.
> > Expensive and breaks the async/sync code separation, IMO.
>
> When does this happen? (multiple outstanding I/O operations on
> one FD). Writes are queued, and there should never be more than
> one read operation on one fd, and no read and writes on the same
> fd.

Let me clarify. When multiple IOs are _queued_ aioRead/Write routines
_detect_ this condition by scanning all pending requests(!) and return
WOULDBLOCK to diskHandleWriteComplete(). diskHandleWriteComplete registers
the FD with select to try again later. Of course, I could miss some static
variables and other nice side effects :); I apologize if I did.

(If we did not register with select(), then how could we retry the IO later?
And we had to retry because aioRead/Write did not let the request through..)

Thus, actual multiple IOs did not happen; but if attempted, they are resolved
by using select().

> As we all know, using select() on disk files is mostly useless.
> On most (all?) platforms it returns ready on every call...

Agree.

> And I do not beleive select is currently used on async-io disk
> operations..

See above. Please correct me if I am wrong.
 
> Yes, the code waits for the previous operation to finish, but
> not by registering the fd with select (if it does then the code
> is/has been broken).

How does it "wait" then??

> We do not need something perfect. Something reasonable is enought.

Sure. Although I would prefer "perfect" if possible :)
 
> Suggestion for both a reasonable load balancing and bypass on high
> load:
>
> * A number of threads assigned to each cache_dir, set by the
> cache_dir directive as different disks have different speeds..
> This tuning capability is especially needed when normal disks and
> RAID disks are combined.

OK. Separate queues are not 100% necessary, IMO, but relatively easy to
support. Tuning will be harder though. Note that we need to limit not just
the number of threads per disk (the easy part), but the queue length per disk
(the cumbersome knob).

> * Once a file is opened all I/O operations on this fd are handled by
> the same thread to minimise the amount of strange effects that can
> be expected on various strange OS:es.

Yes, that was my plan as well. This approach has a lot of advantages. Note
that this would solve multiple-IOs-thru-select problem as well. I hope this
can be done without re-introducing per-thread mutexes, but I have not checked
that yet.

> * Multiple outstanding writes are joined into one (I think this is
> already partially done..), or handled as one large writev().
> * Read prefetching is possible when time permits, as the thread can
> maintain some state info about the files it maintains.

Agree. IMO, both read/write calls should be linked to the memory buffer and
perform according to the space/data available. This will be tricky to
implement, but not impossible, I hope. However, note that this approach
increases threads understanding about the rest of Squid, which could be
objected by some.

> * In order for a object to be cached on disk there must be at least
> one available thread. If there is no thread available on the most
> wanted cache_dir then try the next. Repeated until all available
> cache_dirs are tried.

Disagree. From my experiments waiting for a idle thread is absolutely
prohibitive since it leads to severe under-utilization. All threads should
have something queued. The max length of the queue is a tuning parameter. The
queues should have priorities (the response_time-hit tradeoff discussed
before).

> * Select/poll should not be used for disk files (I don't think the
> current async-io code uses select either)

Agree 100%.
Moreover, if possible we should avoid all sync-IOs when async-io is enabled.
The current code is happy to call sync-ios if something goes a bit wrong.

> * Larger I/O operations than one page should be allowed and swapout
> delayed accordingly. 32K is probably a reasonable figure. Disks
> are quick on sequential read/writes, and having larger I/O
> operations hints the OS to spend some extra effort to try to have
> larger files less fragmented.

Not exactly sure what you mean here, but I think this is covered by
read/write section above.
 
> Effects:
> 1. Swapouts are not initiated on saturated disks

Yes, but we disagree on what "saturated" means.

> 2. The I/O queue for one disk is limited by the number of assigned
> threads, giving a somewhat guaranteed maximum service time.

Yes, the wait queue should be limited to a [configurable] maximum (definitely
not zero!).

> 3. If all disks are saturated then disk is completely bypassed.
> 4. Less work for select/poll.
> 5. Larger I/O sizes allows for more efficient use of the available
> disk spindles (less seeks -> less iops -> higher througput).

Agree.

> 6. The current "magic" (round-robin of some of the less filled disks)
> for disk load balancing is unneeded. Simpy try to fill the first
> available disk with most available space.

Will have to think about that. Load/space balancing is tricky..

> Open issues:
> * How should HITs be handled on saturated disks?
> 1) Should only-if-cached be denied?
> 2) Should other requests be handled as misses?
> My wote is that only-if-cached is denied, and other requests are
> handled as normally (slightly delayed).

I think this should be configurable. Some people pay $$$ for outgoing
bandwidth, and they will kill for a hit. Some will get killed if response
time is too slow. :)

Overall, I think we agree on major issues. I will do more testing in the
beginning of the week, and will be back with some concrete stuff shortly
after.

> Comments are welcome. We should thing this throught before any major
> coding begins.

Look who is talking! The huge-out-of-the-blue-patch man! ;-)

Thanks,

Alex.
Received on Tue Jul 29 2003 - 13:15:53 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:55 MST