Re: Multiple issues in Squid-3.2.3 SMP + rock + aufs + a bit of load

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Mon, 03 Dec 2012 22:25:46 -0700

On 12/03/2012 04:16 PM, Henrik Nordström wrote:
> mån 2012-12-03 klockan 15:53 -0700 skrev Alex Rousskov:
>> Ah, I see another problem with the "each non-waiting engine runs once"
>> approach. One engine may create work for other engines or for itself,
>> including async calls and "run this now" lightweight events. This means
>> we should really continue to use the sawActivity loop so that all
>> lightweight events are processed before I/O wait starts. This design is
>> more complex, but it is actually "more correct".

> Not sure. Same reasoning applies to comm events.

Yes, but we should not intermix comm events and async calls because some
comm work is done outside the AsyncCallQueue order and async callback
recipients may be surprised by it.

> There is some good reasons not to do this. A chain of small lightweight
> events quickly adds up to a heavy event.

Yes, but breaking that chain by injecting new actions in the middle of
it is worse than delaying the next select(2) call. YMMV, but wrong
actions order has caused some of the most difficult to triage and fix
Squid bugs that I can remember.

Only when all I/O loop callbacks go through the AsyncCallQueue APIs, we
can suspend call chains for I/O. And those callbacks probably include
many Comm-internal synchronously called I/O handlers because they also
change FD state and the new state may conflict with the broken chain's
expectations.

If some chain is "too heavy" then either we can carefully break it by
inserting an eventAdd() pause and by using other chain-specific tricks
OR the chain just naturally has to be that long because Squid was
configured to do something complex with a given piece of data.

> But might make sense to drain the AsyncCall queue more than once.

I do not see how that would help. We drain it completely before we start
I/O.

>> The new condition for the sawActivity loop should be:
>>
>> while(sawActivity && !sawHeavyEvent)
>
> What is a heavy event in your view?
>
>> One call is sufficient -- the async call queue drains itself completely,
>> including any calls scheduled during the draining process itself, but
>> the dispatchCalls() call should remain inside the sawActivity loop
>> (because events create calls create "now" events create calls create
>> "now" events ...).
>
> would it be possible to make that only fire already queued calls and let
> the main event loop worry about dealing with chained calls?

While possible, it may create bugs because comm events will interfere
with already scheduled (but postponed) async calls. We must completely
drain all async calls before doing comm I/O.

>> Does the attached patch makes sense to you? Does it solve the "I/O
>> starvation during rebuild" problem you found?
>
> It probably solves the starvation, but I do not like that sawActivity
> loop at all. It too easily creates comm starvation if one is careless
> with event chains. It it is to stay then it need to be limited in number
> of iterations accepted per main loop.

I disagree with this logic. Yes, sawActivity approach supports long
chains, but they are not the loop's fault! If some chain is too long,
the bug is in that chain's code and should be fixed there. If it cannot
be fixed there, then the chain is not "too long", but Squid is just
being asked to do "too much".

> The patch looks good if sawActivity loop-in-the-loop is to stay.

It is difficult to resolve this disagreement. I argue that the
sawActivity loop should stay because breaking call chains by I/O
callbacks will lead to bugs, but I cannot show you any specific bug it
will cause. You argue that some chains might be so long that they will
starve I/O, but you cannot show any specific "too long" chain (after the
heavy event handling bug is fixed).

Perhaps we can agree to let the sleeping dog lie? The bug you needed to
fix was with the heavy events handling. My patch or your next patch fix
that bug. Let's not remove the sawActivity loop until its removal
becomes the best solution for some other bug. We can discuss the loop's
pros and cons in the context of that specific bug then.

Thank you,

Alex.
Received on Tue Dec 04 2012 - 05:25:58 MST

This archive was generated by hypermail 2.2.0 : Tue Dec 04 2012 - 12:00:05 MST