Re: false hit recovery?

From: Henrik Nordstrom <hno@dont-contact.us>
Date: Tue, 24 Nov 1998 22:42:07 +0100

Alex Rousskov wrote:

> > Is there anyone working on false hit recovery?
>
> Yes.

Good. May I ask who?
 
> > forward.c should then cycle throught the list of addresses, and fail
> > when the complete list has been tried two or three times.
..
> - in many cases it does not make sense to try several times
> (if the object is not there, it is unlikely to appear)
> I do not thing it is the best algorithm for several reasons:

Sorry for being unclear. I should have said that this cycling throught
addresses is only to find a place where we may forward the request,
not where the object is. Each peer that is probed and found not to
have the object should of course not be probed again. As you say,
if the object isn't there the first time, then it is not likely
to be there on the second try either.

> - users do not care about #attempts, they care about time

True, but most users prefer a slow reply rather than a error.
(some may prefer a fast error, but I choose not to pay attention to
those ;-)

> - parents and origin servers are special

Yes. I tried to say that, but failed ;-)

> I am not working on this algorithm myself, but that is what I
> would suggest as a tentative plan.
>
> - form a list of servers to try
> - note start time
> - success = false
> - while (!success && there-are-servers-left) {
> ~ if (time-passed >= opt.retry_timeout)
> select next "guaranteed" server
> else
> select next server
> ~ success = try selected server
> }
> - if !success report ERR and maybe list servers tried (in the
> headers?)
>
> "guaranteed" server is the server that must handle misses from us OR the
> only-server-left if the server list contains only one member. Origin servers
> and parents are examples of "guaranteed" servers.

Sounds reasonable. Is close to what I have been thinking on, with the
addition of skipping sibling probes it things get delayed. I was only
thinking on the issue of reaching the object, not so much on the speed
issue.

Perhaps a more appropriate timeout here is to use a quite short timeout
for each individual sibling queried to quickly skip the siblings that
are slow to respond, or perhaps a combination to allow more peers to
be probed when the network is fast and fewer probes if things slow down.

> I am not sure if we should STOP trying if there are guaranteed servers left,
> but (we ran out of time and/or we tried at least one guaranteed server).
> Doing so may result in more ERR messages propagated to the user.

We should not stop trying until we have no further things to try. That
is
why we need to cycle throught the non-responding (or "overloaded")
"guaranteed" servers a few times before giving up. This is primarily
the origin server, but also parents to some extent.

And don't forget that one origin server may have any number of IP
addresses that needs to be tried.

> It would be useful to forward an X-Already-Tried: header to peers so they do
> not re-try some caches we have already checked out. "Some" because they may
> still try caches that we had as siblings and they have as parents. Thus,
> X-Already-Tried-Hitonly and X-Already-Tried-Missallowed may be in order. Ick.

I think only one header listing false hit servers should be enought
here. Assuming that each server keeps track of which of it's peers that
is alive or dead. We also need an option to filter this list at
border connections where organisational privacy may be a issue.
 
> Clearly the scheme can be extended/improved to handle errors other
> than false hits. For example, access denied errors and connection
> errors.

Is there a difference? Not while forwarding a request anyway. A
false hit is when we thought that we may request the object from
a peer (sibling or even parent), but it is not possible from one
or another reason.

The difference is perhaps what we note of the error for future
requests to that address. If we get a lot of "permission denied"
messages from a peer, then they are most likely not willing to
peer with us. If we get a lot of connection failures to a peer
then the peer is either overloaded or has bad network connectivity.
None of there errors should ever be seen by the end user, only
in our logs.

The final message send when we find no where to forward the
request should probably be a kind one, saying something like
"the server is either down, or network is overloaded". The
current message seend when never_direct fails is a bit to
technical to be sent to end users. Instead we should provide
more information in the logs on why the request failed (and
perhaps as comments in the error page for the technically
minded person).

/Henrik
Received on Tue Jul 29 2003 - 13:15:54 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:58 MST