Re: Explaining internal errors from Amos Jeffries on 2010-06-29 (squid-dev)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Tue, 29 Jun 2010 23:31:42 +0000

On Tue, 29 Jun 2010 14:47:43 -0600, Alex Rousskov
<rousskov_at_measurement-factory.com> wrote:
> On 06/28/2010 05:41 PM, Amos Jeffries wrote:
>> On Mon, 28 Jun 2010 13:12:27 -0600, Alex Rousskov
>> <rousskov_at_measurement-factory.com> wrote:
>>> Hello,
>>>
>>> Squid may respond with HTTP 500 Internal Server Error
>>> (HTTP_INTERNAL_SERVER_ERROR) when we hit a bug or when the
communication
>>> with another server fails. The admin may need to distinguish these and
>>> other cases (and their sub-cases) because they may require different
>>> administration actions (e.g., fix a broken ICAP server or configure
more
>>> ephemeral ports).
>>
>> NP: some of these are not 500 errors. The correct HTTP/1.1 status
changes
>> have not yet been ported from 2.7.
>>
>>> Some of the corresponding errors are also reported in cache.log but
many
>>> are not. Moreover, it is often difficult to correlate cache.log and
>>> access.log information, especially on a busy proxy. When we are asked
to
>>> triage or explain various TCP_MISS/500 and NONE/500 log entries, we
have
>>> very few usable sources of information.
>>>
>>>
>>> I suggest adding two new access.log format codes that will contain the
>>> error page ID returned by Squid and the corresponding "error details".
>>>
>>> The "error page ID" field will explain, in most cases, what happened.
We
>>> will log such strings as ERR_SHUTTING_DOWN and ERR_UNSUP_REQ. These
will
>>> come from the err_type enum and the error page ID already maintained
by
>>> Squid.
>>>
>>> The "error detail" field will detail why the error happened, helping
to
>>> distinguish different causes for the same error page ID. The exact
>>> interpretation of the details will depend on the error page ID. For
>>> example, it could be the OS error number for I/O-related errors.
>>>
>>> Any objections or better ideas?
>>
>> +1 on the pageid being logged. It's fairly often asked for.
>>
>> Not so sure on the reason text. Some of the pages are very fuzzy,
though
>> that just means more specific pages are needed I suppose. If you can
find
>> some consistent reason explanation for each case it should be set to go
>> in
>> the page %E code for the user as well.
>
> The error page text, even a perfect one, is irrelevant here. The error
> page goes to the user. The project is about helping the admin. In
> general, the admin sees millions of access.log entries and has no way to
> guess what the corresponding error pages contained. Thus, we need to log
> sufficient information into access.log.
>
> Page ID alone is not sufficient because the same error pages may be used
> under rather different conditions. This is fine for the end-user because
> they usually do not care about the true reason for an internal failure.
> This is bad for the admin because they need to explain 1% of failures to
> their bosses or, worse, fix things.
>
> The "error detail" field will not be text in the initial implementation.
> I plan to store xerrno and various error-specific decision codes there,
> all integers. I will report whether that was sufficient for a specific
> triage case I am working on.
>

Okay.

>
>> With both tags, don't forget the %icap:: / %http:: scoping so we can
have
>> multiple servers error on one request (particularly with icap failover
>> possible when working with a dead http origin).
>
> These new log fields are for the master transaction as a whole, I think,
> and so they should not use a protocol prefix. The logged error codes may
> be specific to a protocol, but the logging code would not know that.
>

Logging knows. The AccessLogEntry is broken down into protocol sub-structs
and the tag code indicates exactly where the data is sourced. This is as
much as logging needs to know.

Anyway, you seem to be intending to store a formatted field layout inside
the one code. That wasn't clear earlier. As long as we can easily
distinguish which if the many servers a particular error came from.

>
>> Being more consistent with use of the ERROR: and FATAL: prefixes to
>> cache.log (as you noted, many errors are not even logged in the
>> administrative cache.log!). With the error page details there to link
>> specific problems to URL and error pageid and reason.
>
> I do not think that would solve the problem, even if all our cache.log
> messages are perfect. Most transaction errors are not and should not be
> reported to cache.log!

I disagree. Log level 2 or 3 would be appropriate for all such transaction
errors IMO, along with the main transaction states. Normally out of sight,
but configurable to visible.
This is regardless and on top of any other information anywhere more
useful. The debug/administrative log is a very appropriate place for this
kind of thing.

Amos
Received on Tue Jun 29 2010 - 23:31:46 MDT

This archive was generated by hypermail 2.2.0 : Thu Jul 01 2010 - 12:00:08 MDT