Re: [RFC] shutdown stages

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Mon, 08 Aug 2011 18:12:47 -0600

On 08/08/2011 04:49 PM, Amos Jeffries wrote:
> On Mon, 08 Aug 2011 11:08:17 -0600, Alex Rousskov wrote:
>> On 08/07/2011 05:22 AM, Amos Jeffries wrote:
>>
>>> At present shutdown happens over two stages. With server socket and auth
>>> shutdown immediately and everything else 30 seconds or so later.
>>>
>>> We have a number of bugs open about this, ranging from complaints that
>>> idle squid still take a full 30sec shutdown when they should fast-track
>>> the process and close quickly. Through to crashes when helper shutdown
>>> or swap.state dump are started too late and not given enough time to
>>> complete.
>>>
>>> I would like to split that process down into several steps with as much
>>> stuff being shut down as fast as reasonable.
>>>
>>> stage 1:
>>> * TIMEOUT: shutdown_lifetime/2 converted to a timeout, to trigger
>>> stage-2 early if needed.
>>> * close ICP, HTCP, HTTP, HTTPS ports, but leave DNS open
>>> * prevent keep-alive, and close all connections on completion
>>> * disable the background cyclic events (ie garbage collection). we can
>>> trigger these "manually" as part of the shutdown if they are needed at
>>> certain points.
>>> * begin swap.state dump of index
>>> * call stage 2 immediately IF no active client connections AND stage<2.
>>>
>>> stage 2:
>>> * TIMEOUT: shutdown_lifetime/2 converted to a timeout, to trigger
>>> stage-3 early if needed.
>>> * disable new reads.
>>> * disconnect delay pools. push any buffered information out to clients
>>> at full speed and terminate. (bad idea? good idea?)
>>> * call stage 3 immediately IF no active client connections AND stage<3.
>>>
>>> stage 3:
>>> * terminate remaining incomplete active clients using their close
>>> handlers.
>>> * loop until async call queue is empty.
>>> * skip to stage N immediately.
>>>
>>> stage N:
>>> * terminate the event loop and exit.
>>>
>>>
>>> Attached is a proposed patch which alters shutting_down into an
>>> incremental stage counter and alters the documentation. But does not
>>> alter the fundamentals of the current state.
>>>
>>> I believe we can easily add several shutdown functions to the
>>> SignalEngine. Or move these to the cache manager as series of async
>>> calls with the SignalEngine simply triggering the cache manager shutdown
>>> event.
>>
>> I do not know whether the proposed stages are right or wrong overall
>> because I do not see what specific principle was used to divide things
>> up this way and not some other way. I do think the concept of stages is
>> applicable, but I would define it differently:
>
> So far I was thinking start the operations which may take several cycles
> of async call as early as possible. Like closing idle helpers and
> connections. Will take one maybe two loops of AsyncCall and TCP RTT.

I do not like this because main.cc and individual operations/modules do
not really know whether they will be needed during the first "nice"
shutdown stage: A helper that is idle now may be needed to finish a
pending transaction a second later; we should not be shutting such a
helper down while there are still active master transactions.

I think we agree that there are at least two different problems shutdown
polishing should address:

1. Shutdown sometimes waits for Squid to do nothing. We should always
shut down ASAP instead.

2. Squid crashes during shutdown because things are shut in the wrong
order.

The first problem is not due to many main loop iterations. It is due to
waiting for N seconds before doing something even if all master
transactions are gone during the first shutdown second. Main loop
iterations themselves are very fast. Waiting for something takes time.
We can fix that if we focus on detecting the end of all master
transactions (which is not hard once the concept is introduced).

The second problem can be partially addressed by forcefully killing all
master transactions before we shut down the modules they use. After
that, the modules should be shutdown as they become idle. We would
probably still have to hand-break a few ties and circular dependencies,
but hopefully not too much (and we should avoid the temptation to do
this by hand, IMO).

> Kicking as many as possible during the grace period minimises the
> TIME_WAIT and RAM usage quickly.
> Similar to what you have below but more focus of resource reduction
> growth-curve.

I do not think it is important (or even a good idea) to free as much as
we can during the grace period. It is important to end the grace period
as soon as possible and not crash or lose functionality while in that
period. That is why I think it is best to let things run normally during
this stage, with the exception of accepting new master transactions.

>> Stage 1; "Time-limited no-harm winding down": Stop accepting/creating
>> new master transactions. All existing master transactions should start
>> shutting down if possible. No harm to any existing transactions at this
>> stage. The maximum duration of this stage is controlled by
>> shutdown_lifetime in squid.conf.
>>
>> Stage 2; "Controlled or high-level tear down": All existing master
>> transactions are forcefully terminated. All modules are asked to shut
>> down if possible. If some cache_dirs need to start writing their index
>> to disk, they will do this at this time. Event loop ends at the very end
>> of this stage, when there are no new events to process. The maximum
>> duration of this stage is unknown/unlimited because a module may take a
>> long time to cleanup (e.g., write its index file to disk).
>>
>> Stage 3; "Uncontrolled or low-level tear down": Squid main() exits.
>> Static destruction/cleanup starts. The maximum duration of this stage is
>> unknown/unlimited but no slow activities are expected at this stage.
>>
>>
>> We could limit Stage 2+3 duration by forcefully killing our own process,
>> I guess. I do not know whether that is needed. It would require a
>> separate timeout value, IMO.
>
> Came to a similar conclusion. Although was hoping we could avoid both.
> The only thing which does not complete in a short period is the
> swap.state dump. IIRC you have that going in steps of N objects, so at
> the queue drain stage it could still self-abort and log a failure
> encouraging a longer shutdown_lifetime.

Rock Store does not dump swap state, but yes, UFS does. I think it
should be killed externally if it is too slow. (Via a second timeout).

My understanding is that the existing shutdown sequence and the existing
timeout are meant for things to end nicely. We do not want to break
anything except for new master transactions. If we want to do something
harmful such as aborting a clean swap state dump, there should be a
separate option/timeout for that. Otherwise, an admin who is not in a
rush does not have a way to quit Squid without risking major long-term
problems (such as a lack of a clean swap state for UFS).

> I see no reason why we can't push the swap.state dump to start as soon
> as the accept queue is drained. No new requests could add things to the
> index. That can be left until the rockstore diskers are merged though.

Existing master transactions could still alter the index (e.g., a DELETE
transaction that is waiting for a slow ACL to proceed or a MISS
transaction that adds a new cached object). I agree that we _can_
exclude current transactions from being counted, but I think it would be
better to start with a simpler, safer scheme.

> The main push now is just to get the shutdown global counting stages
> instead of being on/off toggle. The change does not affect any of the
> code which depends on on/off behaviour. That will stay happening in the
> early stage-1 as it is now, until updated.

I do not see the advantage of going from a boolean to an integer. The
stages I have described do not need an integer count at all, and any "if
(shutdown_stage > 7) ..." kind of code is not going to make things
clearer, IMO.

>> 2) The event loop (including async call queue) should run as long as
>> somebody is using them. Their too-early termination is one of the
>> biggest design flaws causing bugs now. I believe you have documented
>> that they should run longer, but I am not sure. I tried to be more
>> explicit about it in my stage descriptions.
>
>
> My stage 2+3 is your stage 2. Due to those bug reports involving
> AsyncCall which are queued but never happen. We need some time after the
> force-close to drain the AsyncQueue(s) before we can start the forced
> exit of event->loop() and main().

I do not think we should ever force-exit the main loop. If we decide
that a second "drop everything and quit" timeout is needed, that
last-resort timeout should simply kill the Squid process via low-level
tools such as signal() and/or _exit(), IMO.

The main loop is so fundamental to proper Squid operation that
force-quitting it is essentially equivalent to killing the Squid process.

> So the purpose of stage-3 being to wait for the queue to drain.

Yes, we agree that there should be a stage where we wait for Squid to
quit after all master transactions are gone or killed.

>> 3) I would not "disable new reads" as a separate step. At some point
>> (when the shutdown timeout expires), we have to kill all still-existing
>> transactions, if any. At that point, all I/O activity will cease. This
>> should be driven from transactions to Comm, and not by some global Comm
>> flags, IMO.
>>
>
> The read schedule can check if (shutdown>N) return COMM_ERR_SHUTDOWN or
> somesuch. No new globals.
> No problem either yay.

I agree that Comm could use the shutdown global to call back with
errors, but I do not see the point of this complication: If we are
explicitly killing all master transactions, there should be no need to
poke at them from behind like that. And if we are not killing them yet,
they should be allowed to run normally.

HTH,

Alex.
Received on Tue Aug 09 2011 - 00:13:05 MDT

This archive was generated by hypermail 2.2.0 : Tue Aug 09 2011 - 12:00:03 MDT