Re: squid-smp: synchronization issue & solutions from Amos Jeffries on 2009-11-24 (squid-dev)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Wed, 25 Nov 2009 15:18:13 +1300

On Tue, 24 Nov 2009 16:13:37 -0700, Alex Rousskov
<rousskov_at_measurement-factory.com> wrote:
> On 11/20/2009 10:59 PM, Robert Collins wrote:
>> On Tue, 2009-11-17 at 08:45 -0700, Alex Rousskov wrote:
>>>>> Q1. What are the major areas or units of asynchronous code
execution?
>>>>> Some of us may prefer large areas such as "http_port acceptor" or
>>>>> "cache" or "server side". Others may root for AsyncJob as the
largest
>>>>> asynchronous unit of execution. These two approaches and their
>>>>> implications differ a lot. There may be other designs worth
>>>>> considering.
>
>> I'd like to let people start writing (and perf testing!) patches. To
>> unblock people. I think the primary questions are:
>> - do we permit multiple approaches inside the same code base. E.g.
>> OpenMP in some bits, pthreads / windows threads elsewhere, and 'job
>> queues' or some such abstraction elsewhere ?
>> (I vote yes, but with caution: someone trying something we don't
>> already do should keep it on a branch and really measure it well until
>> its got plenty of buy in).
>
> I vote for multiple approaches at lower levels of the architecture and
> against multiple approaches at highest level of the architecture. My Q1
> was only about the highest levels, BTW.
>
> For example, I do not think it is a good idea to allow a combination of
> OpenMP, ACE, and something else as a top-level design. Understanding,
> supporting, and tuning such a mix would be a nightmare, IMO.
>
> On the other hand, using threads within some disk storage schemes while
> using processes for things like "cache" may make a lot of sense, and we
> already have examples of some of that working.
>

OpenMP seems almost unanimous negative by the people who know it.

>
> This is why I believe that the decision of processes versus threads *at
> the highest level* of the architecture is so important. Yes, we are,
> can, and will use threads at lower levels. There is no argument there.
> The question is whether we can also use threads to split Squid into
> several instances of "major areas" like client side(s), cache(s), and
> server side(s).
>
> See Henrik's email on why it is difficult to use threads at highest
> levels. I am not convinced yet, but I do see Henrik's point, and I
> consider the dangers he cites critical for the right Q1 answer.
>
>
>> - If we do *not* permit multiple approaches, then what approach do we
>> want for parallelisation. E.g. a number of long lived threads that take
>> on work, or many transient threads as particular bits of the code need
>> threads. I favour the former (long lived 'worker' threads).
>
> For highest-level models, I do not think that "one job per
> thread/process", "one call per thread/process", or any other "one little
> short-lived something per thread/process" is a good idea. I do believe
> we have to parallelize "major areas", and I think we should support
> multiple instances of some of those "areas" (e.g., multiple client
> sides). Each "major area" would be long-lived process/thread, of course.

Agreed. mostly.

As Rob points out the idea is for one small'ish pathway of the code to be
run N times with different state data each time by a single thread.

Sachins' initial AcceptFD thread proposal would perhapse be exemplar for
this type of thread. Where one thread does the comm layer; accept() through
to the scheduling call hand-off to handlers outside comm. Then goes back
for the next accept().

The only performance issue brought up was by you that its particular case
might flood the slower main process if done first. Not all code can be done
this way.

Overheads are simply moving the state data in/out of the thread. IMO
starting/stopping threads too often is a fairly bad idea. Most events will
end up being grouped together into types (perhapse categorized by
component, perhapse by client request, perhapse by pathway) with a small
thread dedicated to handling that type of call.

>
> Again for higher-level models, I am also skeptical that it is a good
> idea to just split Squid into N mostly non-cooperating nearly identical
> instances. It may be the right first step, but I would like to offer
> more than that in terms of overall performance and tunability.

The answer to that is: of all the SMP models we theorize, that one is the
only proven model so far.
Administrators are already doing it with all the instance management
manually handled on quad+ core machines. With a lot of performance success.

In last nights discussion on IRC we covered what issues are outstanding
from making this automatic and all are resolvable except cache index. It's
not easily shareable between instances.

>
> I hope the above explains why I consider Q1 critical for the meant
> "highest level" scope and why "we already use processes and threads" is
> certainly true but irrelevant within that scope.
>
>
> Thank you,
>
> Alex.

Thank you for clarifying that. I now think we are all more or less headed
in the same direction(s). With three models proposed for the overall
architecture.

In the order they were brought up... (NP: the TODO only applies if we work
towards that goal)

MODEL: * fully threaded. some helper child processes
PROS:
smaller memory resource footprint when running.

CONS:
potentially larger CPU footprint swapping data between threads.
potential problems making threaded paths too small vs the overheads.

TODO:
  continue polishing the code into distinct calls
  determine thread-safe code
  determine shared data and add appropriate locking
  make above segments into threads.
  add some way to pass events/calls to existing long-term threads
  either ... a super-lock as described by Henrik,
  or ... a 2-queue alternative as described by Amos

MODEL: * process chunks with sub-threads and sometimes helper child
processes
PROS:
it's known to be very fast. but not amazingly so. (ref: postfix) (ref:
squid helpers)

CONS:
current code uses a LOT of data sharing between components. particularly
of small 1-32 byte chunks of random data (config flags, stats, shared cache
data snippets).
identifying distinct chunks is a big time consuming issue.

TODO:
  identify the major process chunks and splitting out from the main binary
  add efficient ways to pass data between cleanly between processes (at
capacity).
  copy relevant external shared data into the state objects to pass along
with the request data
  plus all same TODO from fully-threaded model, for the sub-threads within
each process.

MODEL: * separate instances with sub-threads and helper child processes
PROS:
we can almost do the macro change today. (sub-threads later)
it can scale the base app speed up a reasonable percentage (ref:
apache2)

CONS:
duplication of data. particularly in the storage. is very wasteful of
resources.
NP: apache evade this with effectively read-only disk data, all dynamics
are in the instance memory.

TODO:
the -I option needs porting so the master can open main ports and
children share the listening.
finish the logging TCP module ideas (for reliable shared logging).
some code to make the master process handle multiple children.
some alterations to safely handle the shared config file settings
(cache_dir etc).

MODEL: * status-quo.
Where we continue to work on all the above TODOs as time permits and
needs require. wait and see which model gets finished first.

PROS:
the way forward is already well known.

CONS:
it's not fast enough reaching multi-CPU usage

The easiest way forward seems to be toward separate instances, with finer
grained threading and/or process chunking being done later after deeper
analysis for extra gains at each change.

This makes me think that we are not in fact proposing competing models,
but simply looking at different levels of code. Each approach which has
come up may best be used at varying levels; upper (instances), middle
(processes, threads, jobs), and low (signals, events, cbdata, async calls).

It also seems to me the top instances choice is the most easily reversed
if it's found to actually be a bad idea. The major support change being in
the parent main() code setting up for several children instances.
Possibilities there for configuring it on/off or how many instances.

Amos
Received on Wed Nov 25 2009 - 02:18:18 MST

This archive was generated by hypermail 2.2.0 : Wed Nov 25 2009 - 12:00:10 MST