Re: Multiple issues in Squid-3.2.3 SMP + rock + aufs + a bit of load

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Mon, 03 Dec 2012 23:12:40 -0700

On 12/03/2012 04:18 PM, Amos Jeffries wrote:

> Would it make sense to do one of:
>
> * have the AsyncCall queue drain with a timeout. So that no sequence of
> calls can lag any of the sensitive engines too much. That along with the
> event engines scheduling their heavyweight actions as Calls instead of
> immediately executing.
>
> * maintain a real-time benchmark of each event by the AsyncCalls engine.
> Marking a min/max duration of time spent servicing each call. So that we
> can identify the heavyweight AsyncCalls and schedule a Call checking for
> signals and events after each big one.
>
> * split the single AsyncCalls queue into a Xaction and Bg queues. Such
> that active transaction calls, signals, and events are not held up by
> the background events which are not sensitive to lag.
>
> With a main loop something like:
> + signals scheduled into one Queue
> + timed events scheduled into one Queue
> + signals scheduled into one Queue
> + Xaction Queue draining
> + signals scheduled into one Queue
> + one Bg Call drained
> + ... repeat
>
> With two queues, we can schedule as many heavy-weight Calls as needed at
> any point but not block the loop on all of them happening. While also
> 'blocking' the loop for the other more important Calls queue.

Overall, I sincerely doubt both (a) our ability to correctly engineer
such a complex auto-tuning system and (b) that it will give significant
performance improvement compared to much simpler approaches. Here is why.

On a conceptual level, Squid needs to do W amount of required work for a
given transaction. That W work is unavoidable (by definition), but we
also add some processing overheads O. While W stays the same, the O
component and the W+O sum only increase with each additional transaction
processing pause.

Is delaying I/O loop for "too long" a real design concern when dealing
with transactions? Not really. If such a delay happens, it just means
that Squid is overloaded. There is just "too much" work and delaying
some of it is likely to make things worse. We should focus on reducing O
instead.

Can bugs cause "too long" I/O loop delays even when there is not too
much required work? Of course, they can! There are things beyond
transaction handling that can and should be delayed if needed because
they have flexible deadlines and because delaying them does not
immediately create more work (via disconnecting and reconnecting users
and such). Index rebuilding is one of such background tasks.

Sizing such background tasks correctly is difficult. Just because we
fixed the heavy event processing, does not mean that index rebuild is
working fine now. It is probably still too slow or too invasive,
depending on the specific environment.

However, tuning that is outside the main loop scope as long as there is
a working API to postpone non-urgent things until some progress has been
made with urgent ones. The main loop just does not know enough to guess
the correct amount of time these background tasks should be postponed
for, especially when some tasks affect transaction-related things like
hit ratio.

We may need to add more or modify existing events API to add more
scheduling criteria, in addition to current time and "some progress has
been made" ones. However, this should not affect how async calls and
Comm loop interact.

BTW, once zero-delay events are removed from the code, the event queue
will no longer need to be a part of the wasActivity loop and the loop
will disappear. The code will look very similar to what you have
outlined then, but it would still be the responsibility of the
higher-level code to identify heavy/background events and schedule them
accordingly. That approach yields a much simpler but still
efficient/tunable design IMO.

Cheers,

Alex.
Received on Tue Dec 04 2012 - 06:12:48 MST

This archive was generated by hypermail 2.2.0 : Tue Dec 04 2012 - 12:00:05 MST