Re: Explaining internal errors

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Tue, 29 Jun 2010 14:47:43 -0600

On 06/28/2010 05:41 PM, Amos Jeffries wrote:
> On Mon, 28 Jun 2010 13:12:27 -0600, Alex Rousskov
> <rousskov_at_measurement-factory.com> wrote:
>> Hello,
>>
>> Squid may respond with HTTP 500 Internal Server Error
>> (HTTP_INTERNAL_SERVER_ERROR) when we hit a bug or when the communication
>> with another server fails. The admin may need to distinguish these and
>> other cases (and their sub-cases) because they may require different
>> administration actions (e.g., fix a broken ICAP server or configure more
>> ephemeral ports).
>
> NP: some of these are not 500 errors. The correct HTTP/1.1 status changes
> have not yet been ported from 2.7.
>
>> Some of the corresponding errors are also reported in cache.log but many
>> are not. Moreover, it is often difficult to correlate cache.log and
>> access.log information, especially on a busy proxy. When we are asked to
>> triage or explain various TCP_MISS/500 and NONE/500 log entries, we have
>> very few usable sources of information.
>>
>>
>> I suggest adding two new access.log format codes that will contain the
>> error page ID returned by Squid and the corresponding "error details".
>>
>> The "error page ID" field will explain, in most cases, what happened. We
>> will log such strings as ERR_SHUTTING_DOWN and ERR_UNSUP_REQ. These will
>> come from the err_type enum and the error page ID already maintained by
>> Squid.
>>
>> The "error detail" field will detail why the error happened, helping to
>> distinguish different causes for the same error page ID. The exact
>> interpretation of the details will depend on the error page ID. For
>> example, it could be the OS error number for I/O-related errors.
>>
>> Any objections or better ideas?
>
> +1 on the pageid being logged. It's fairly often asked for.
>
> Not so sure on the reason text. Some of the pages are very fuzzy, though
> that just means more specific pages are needed I suppose. If you can find
> some consistent reason explanation for each case it should be set to go in
> the page %E code for the user as well.

The error page text, even a perfect one, is irrelevant here. The error
page goes to the user. The project is about helping the admin. In
general, the admin sees millions of access.log entries and has no way to
guess what the corresponding error pages contained. Thus, we need to log
sufficient information into access.log.

Page ID alone is not sufficient because the same error pages may be used
under rather different conditions. This is fine for the end-user because
they usually do not care about the true reason for an internal failure.
This is bad for the admin because they need to explain 1% of failures to
their bosses or, worse, fix things.

The "error detail" field will not be text in the initial implementation.
I plan to store xerrno and various error-specific decision codes there,
all integers. I will report whether that was sufficient for a specific
triage case I am working on.

> With both tags, don't forget the %icap:: / %http:: scoping so we can have
> multiple servers error on one request (particularly with icap failover
> possible when working with a dead http origin).

These new log fields are for the master transaction as a whole, I think,
and so they should not use a protocol prefix. The logged error codes may
be specific to a protocol, but the logging code would not know that.

> Being more consistent with use of the ERROR: and FATAL: prefixes to
> cache.log (as you noted, many errors are not even logged in the
> administrative cache.log!). With the error page details there to link
> specific problems to URL and error pageid and reason.

I do not think that would solve the problem, even if all our cache.log
messages are perfect. Most transaction errors are not and should not be
reported to cache.log!

Thank you,

Alex.
P.S. Squid3 has a LFT_SQUID_ERROR placeholder for the error page ID already.
Received on Tue Jun 29 2010 - 20:48:32 MDT

This archive was generated by hypermail 2.2.0 : Wed Jun 30 2010 - 12:00:08 MDT