Re: Async I/O from Henrik Nordstrom on 1998-09-15 (squid-dev)

From: Henrik Nordstrom <hno@dont-contact.us>
Date: Wed, 16 Sep 1998 04:49:48 +0200

Alex Rousskov wrote:
>
> Both, unfortunately. There are two "if (wrong) return EWOULDBLOCK" statements
> in aioRead/Write, I think.

Looking at the code again I found that there is three, but one of them
can only
occur if Squid runs out of memory (the aio_XXX call fails), which by
itself is
a fatal error.

> The pool "dried out" almost immediately under load in our tests.
> Unfortunately, it took me a while to discover that...

I have a hard time beleiving the request structure pool in async_io.c
dried out
in your tests. See below.

> Two cases: fixed-size pool of request structures is empty OR two IOs
> got submitted for one FD.

Looking at the code yet another time I see that aioRead/aioWrite is not
very
well done. There is a attempt at handling multiple IOs for one FD, but
if
both a aioRead and a aioWrite issued on the same fd then the result is
unpredictable. But then again, it should not happen...

I did some testing, and it looks like I thought it would. Squid never
issues
two IO operations on the same FD. What I did discover (and should have
noted
earlier) is that async_io.c request structures are eaten by finished
closes
until the main thread notices that the close is completed, but this
should not
be a problem unless you are running with a huge amount of threads
compared
to number of filedescriptors (SQUID_MAX_FD).

> Here is how I see the select being called:
> - diskHandleWrite() calls aioWrite() with diskHandleWriteComplete() as a
> callback
> - aioWrite() may call diskHandleWriteComplete() with EWOULDBLOCK

only if the async_io.c requests pool is empty, or if there already is a
active
operation on this FD, both of which should not happen.

> - diskHandleWriteComplete() seems to call commSetSelect() if
> /* another block is queued */

True. This is most likely not the right thing to do here. I actually did
beleive this called aioWrite again, but it doesn't.

I have no idea why the code is written like it is, async or not.

   write one request
   (wait)
   if more writes pending then
     join remaining requests
     wait for select
   endif

Should be more like
   join all pending request
   write
   (wait)
   if more writes pending (can only happen if async)
      do it again (call diskHandleWrite again)
   endif

> > I think the hardest part is to ensured that the memory is not reused
> > until the call is completed. The current async-io code copies all data
> > between the main thread and I/O threads, avoiding any locking of memory
> > buffers.
>
> Yes, this would be the hardest part along with updating the offsets. Simple
> but not efficient if done in the main thread. Tricky but efficient if done by
> other threads.

I do not agree here. I would say both simple and efficient if done in
the main
thread.
* Main thread knows about which writes are being made, and when they
complete.
About the only thing we need to add locking to is the object data.
* If done by the threads then more locking would be needed, making it
both
inefficient and complex.

> OK. I have implemented the queue limits, but found that they do not work very
> well, probably because information about requests-to-come and actual queue
> length cannot be synchronized well. The queue keeps fluctuating and going
> off-limits a *lot*. We (and the user) might be better off configuring Squid
> in terms of outstanding swap-ins/swap-outs (not individual requests). I have
> implemented the limit on swap-outs and found it working well, much better that
> the limit on queue length. The two approaches could be combined though.

Interesting.. but not to unexpected.

I think queue limits are a better measure than pending swapouts, but
obviously not the raw queue-length values. A more appropriate value
is probably (max?) queue length average for some period of time
(perhaps 10 seconds), much like system load average of running
processes as reported by uptime.

Example function:
FRAC = 1/exp(1/10) (or some other fraction)
load = load * FRAC + current_load * (1 - FRAC);

> Sounds a bit complicated to me, but we can try it.

I wouldn't say it is very complicated. It is 2 passes throught the
available disks until a suitable disk is found. If none found then
there is no suitable disk available.

Sorting the disks is trivial.

> Just note that there is no up-to-date information on actual queue length
> that will not change while you are making your decisions and queueing
> requests (see the problem with queue length limit above).

See my answer above. It is possible to get quite reliable load values
even it the value is fluctuating.

And for the first pass fast fluctuating values are wanted, to quickly
push requests onto the second, third, ... disk when the load creeps up,
while having a higher probability for the disk(s) with most free space.
This is to maintain a good HIT time on all disks.

/Henrik
Received on Tue Jul 29 2003 - 13:15:54 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:55 MST