Re: [PATCH] Bug 2680: ** helper errors after -k rotate

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Thu, 16 Jul 2009 14:08:24 +1200

On Thu, 16 Jul 2009 11:13:14 +1000, Robert Collins
<robertc_at_robertcollins.net> wrote:
> On Thu, 2009-07-16 at 12:44 +1200, Amos Jeffries wrote:
>>
>>
>> Well, as I'm seeing it now. The rotate/reconfigure case is are where
>> you
>> both argue its reasonable to have more than N helpers running.
>
> Yes. Trivial example of problems: assume that the user configures one
> stateful helper. They then change the command line and reconfigure.
> We must start a new helper or its stderr goes to the wrong log
> indefinitely.
> We are not meant to send new requests to the old helper (we're trying to
> shut it down)
> If we do not permit multiple helpers, then we cannot start the new
> helper until the old one completes, and as that is [modulo timeouts]
> under client control, we could be waiting for an arbitrary time with all
> new requests getting queued up waiting for the helper to shutdown.
> That could be several seconds.
>
> Note that with NTLM this is realistic - there's no need for multiple
> helpers as a single helper can serve many concurrent requests.
>
>> The hidden bugs I've seen are the ntlm helper infinite reservation
>> bug. It
>> shows up clearly when squid is checking the number of helpers running.
>> It
>> shows up rarely and only under high-load when squid is allowed to open
>> more
>> helpers to fill the gap each time they run low. Then crashes claiming
>> too
>> many helpers are running, nothing to indicate the reservation hangs
>> that
>> caused it.
>
> So two things:
> - theres no reason that the overload support code can't hard-limit the
> number of *non-shutting-down-helpers*. (reconfigure marks all old
> helpers as being queued to shutdown). So you can fix/address that
> DOS case without altering the reconfigure case.
> - The above sounds like a bug that isn't related to reconfigure.

Both reconfigure and helper recovery use startHelpers() where the limit
needs to take place.
The DOS bug fix broke *rotate* (reconfigure has an async step added by Alex
that prevents it being a problem).

>
>> Also the sudden memory swapping issues mentioned when either the above
>> or a
>> rotate/reconfigure hits. OpenWRT and OLPC are the limited-memory cases
>> here. If OpenWRT gets bloated memory the network effects are
>> unpredictable
>> leading to loss.
>
> If someone is running hundreds of helpers on openwrt/olpc then things
> are broken already :). I'd really suggest that such environments
> pipeline through a single helper rather than many concurrent helpers.
> Such platorms are single core and you'll get better usage of memory
> doing many requests in a single helper than one request each to many
> helpers.

lol, NTLM concurrent? try it!

>
>> We would need to document that during rotate 200% of the allocated
>> helpers
>> are running, and that at any other time 150% of the helpers may be
>> running.
>> But I don't see a real need to do it this way other than "its easy"
>
> I don't see what the 150% is about - that sounds unconnected to
> reconfigure ;). Perhaps the helper limit you added is just failing to
> ignore a 'shutting_down' helper?

This:
helpers.cc line helperServerFree() and helperStatefulServerFree()

  assert(hlp->n_running >= 0); // AYJ: kills squid in bug 2680 when async
shutdown happens after limited startup.

    if (!srv->flags.shutdown) { // AYJ: prevents properly shutdown helpers
being restarted
        hlp->n_active--;
        assert(hlp->n_active >= 0);
        debugs(84, 0, "WARNING: " << hlp->id_name << " #" << srv->index + 1
<<
               " (FD " << fd << ") exited");

        if (hlp->n_active < hlp->n_to_start / 2) {
            debugs(80, 0, "Too few " << hlp->id_name << " processes are
running");

            if (hlp->last_restart > squid_curtime - 30)
                fatalf("The %s helpers are crashing too rapidly, need
help!\n", hlp->id_name);

            debugs(80, 0, "Starting new helpers");

// AYJ: attempts to start N helpers when 50% of N have died (0-50% + 100% =
100-150% N)
            helperOpenServers(hlp);
        }
    }

this is triggered on helper death OR helper needed and none available for
use (one overly long in-use kicked into death).

>
>> Would it be reasonable to do this for a clean bounded rollover:
>> send a flag to shutdown indicating a shutdown schedule of M helpers,
>> then
>> a restart back to N, then another M helpers to shutdown. This would
>> all be
>> scheduled from the shutdown loop to keep track of which helpers are in
>> which set. Without the flag shutdown skips the restart attempts and
>> does
>> the current behavior.
>> So that the rollover happens async and cleanly, but we never have no
>> excess
>> helpers running?
>
> No. That makes reconfigure _more_ of a critical section. reconfigure
> needs to create a new config, activate for all new requests, and then
> let refcounting take care of letting the old config (and all the things
> it used) go away).

So we do need to track how many are shutting down and ignore them on
starting.
Okay, I'll work up a patch to do it that way and see if you agree with that
one.

Amos
Received on Thu Jul 16 2009 - 02:08:33 MDT

This archive was generated by hypermail 2.2.0 : Thu Jul 16 2009 - 12:00:05 MDT