Re: io assumptions

From: Robert Collins <robertc@dont-contact.us>
Date: 05 Dec 2002 19:49:17 +1100

On Thu, 2002-12-05 at 19:26, Henrik Nordstrom wrote:
> Robert Collins wrote:
>
> > storeDiskdOpen for instance, if the diskd shmget fails, cleans up the
> > request and returns NULL - inficating a failure. If the request is
> > queued, then yes it currently returns after the next io loop. BUT:
> > overlapped IO (or any OS-callback based IO) could potentially callback
> > immediately if the file metadata is in cache - breaking the current calling code.
>
> I don't agree here.
>
> The callback on the SIO may only occur when the store is beeing polled
> or there is specific I/O activity, not randomly at any time. Should
> probably be limited to polling only to avoid I/O error races.

That limitation is easy. I'm talking about the following cycle:

storeOpen()->SIO.open()->OS.Open()->SIO.Completed()->storeClosed(error);

So it is a callback based OS behaviour, but occuring immediately.

> If your underlying I/O mechanism has a builtin callback mechanism where
> the callback is made asyncronously without polling then this callback
> must only be into the "fs" driver, not on the SIO object, and the "fs"
> implementation needs to queue the event until it can be processed in a
> sane manner. You also need to employ some kind of safe locking in such
> case.. such designs is probably only safe when the callback occurs as a
> new thread.

Well, once we have clear semantics, this can definately be done. The
semantics are not clear to me on first principles, and while I can study
the current behaviour, at the moment I think it's somewhat arbitrary and
inconsistent.

> > Another example: storeUfsOpen returns NULL on open failure, an object on
> > open success.
> >
> > And storeAufsOpen returns NULL to shed IO load, and an object that can
> > have reads queued - but that may not actually open successfully - if the
> > request gets queued.
>
> Yes?
>
> Same thing on create.

The point is that:
Ufs returns NULL on *IO failure during open()*.
Aufs calls the close callback on *IO failure during open()*

That is inconsistent.

> > I think we would be better served by:
> > * void return type.
> > * open always calls back, with error (failure of some sort) or good object (success on open).
> > * And the callback is allowed to occur immediately.
>
> The second part of the second point is not acceptable (callback to
> return "good" object) and won't solve your problem unless you are
> willing to paint us into a FS design corner which the current design is
> deliberately designed to avoid.

What design corner? I'm not trying to be difficult here, I just am not
clear on the issue.

However, assuming that it's to allow 'speculative' queueing to reads or
writes before the physical IO has occured, let me make another
suggestion:

* Always return a SIO object - even on failure.
* Call forward to a error callback as soon as an error is detected,
whether during the open queuing stage, or a physical error.

> There is also a callback on close. This callback is also used for
> signalling I/O errors, including failed open/create.
>
> What should perhaps be done is to separate I/O errors from close, and
> not automatically destroy the SIO on I/O errors.

Yes.

> In terms of the storeio
> API a failed open/create is just a kind of I/O error. There is nothing
> special about a failed open/create. The same things happens if a read or
> write fails.
>
> This design is intentionally done such that the core does not rely on
> when/how/why the FS layer opens/closes files, assigns object identities
> etc, only that it gets done.

This all makes sense.

> Yes, this makes life slightly more complex in the storeio layer, but is
> very much intentional as other designs paints you into corners where
> many interesting object store design cannot be done without a great deal
> of complexity.

This is the bit I need to grok more. I don't percieve the corners.
 
> The property I am defending here:
>
> The object identity assignment part of storeCreate() should be allowed
> to be delayed to where sufficient amount of data has been sent to the
> storeio layer, possibly the whole object contents. This to allow for
> storeio implementations which uses the object itentity as a pointer to
> where the data is stored and not as an indirect name (a UNIX file name
> is a indirect name, a block pointer is not), and to be allowed to assign
> this when the data can be laid out on disk.

Ok. This is fine, and fits with my 'speculative reads or writes'
assumption above. I think we need to be more clear about data loss

> What I can accept as a change in the storeio API here is that
> storeOpen() always returns a SIO and only the callback is used for
> signalling "I/O errors" such as load schredding or other events where
> the object cannot be accessed. But I see no good reasons why to do this
> and it increases the overhead significantly in the load schredding case.

Ok, some reasons:
1) It's a more consistent API. Consistency makes it easier to work with
and adapt.
2) It allows for more advanced error conditions - such as retrying
open() calls - if desired.

On the performance side, a NullFoo object can be returned in the
shedding case, making the overhead nearly 0.
(
  if (shed()) {
    NullMyType.tellshed(callback, callback_data);
    return &NullMyType;
  }
)
 
> I can also accept that I/O errros does not automatically close the SIO
> and storeClose() must be called unless it has already been called for
> the SIO, but this mainly makes the storeio implementation more complex
> as it then needs to have two slightly different paths in how to deal
> with I/O errors (one if the SIO is currently open, another if the SIO
> has already been closed and the caller is waiting for all writes to
> complete)

What about this (thinking out loud):
storeClose() just signals that the core doesn't want the object anymore.
if we reference count the object instead, it will detect when it's not
wanted, and it can just quietly vanish. It's up to the store layers io
object to clean up properly.

Rob

Received on Thu Dec 05 2002 - 01:49:21 MST

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:19:00 MST