Re: apache style squid?

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Wed, 8 Oct 1997 19:24:05 +0200 (EETDST)

--MimeMultipartBoundary
Content-type: text/plain; charset=US-ASCII
Content-transfer-encoding: 7BIT

> > > > Yep. Lots of thought. I;d like to see it done as threads though.
> > Ideally, imho, there should run multiple processes with shared index,
> > each serving multiple concurrent sessions. If configured to use either
> > zillions-single-thread-processes it might work well on OS'es that have
> > no thread support or better perprocess memory usage, while
> > other OS'es that love threads more than processes can run as
> > zillions-threads-in-few-processes.
>
> Yes, but that still needs to have locking on data structures. The
> University of Melbourne experimented with threading squid and mutex()'ing
> the index updates and found that the lokcing overheads meant there was only
> about minor speedup on a multi-CPU machine. This is my reasoning for wanting
> to avoid any form of locking.

 Have you experience with multi-cpu threaded program that can live without locking?
 If you are sure that this works, its a good news, my experience recalls debugging
 nightmares of totally strange and random crashes...
 
> being used. I'm proposing ONLY using thread_create() to create a pool of
> threads. mutex() locks can then be used to control scheduling of those
> threads when their context areas have been updated.

 I think keeping few thousands of threads stand-by will create lots of wasteful
 overhead as they are still considered by scheduler even if idling.
 I'd propose creating threads as needed and let them die if they are
 not reused for some time. I think cond_timedwait() is perfect for both waking
 up idle threads and timing them out.
 
> > Well, depends. thread switch of kernel threads is comparable, but not
> > thread creation/exiti. threads are much faster here.
>
> We've found threads to be quite lightweight. It depends on the type
> of thread you're creating, and whether or not you're allocating out of a bound
> thread pool or not. It's not that simple.

 Right. There's the difference depending on thread type. Here it came up as
 usermode threads vs kernel threads, that's (un)bound threads in solaris terms.
 Usermode thread switch is very fast, basically as fast as library call, while
 kernel level thread switch is as fast as context switch, but it is still faster than
 process switch, I think.
 Best with threads is that you can mix and match both - user and kernel threads,
 use user threads where blocking system calls are not used and we can have very many
 independant tasks running "in parallel" while each is kind of classical program model.
 Where you need blocking calls, use kernel threads and/or processes and expect
 higher cost for thread switch.
 
> > Not only a little, you'd need to lock every shared structure that can
> > change for both - writing AND reading. Assumption that while we read
> > multi-item struct the thread (process) switch would NOT occur is far from
> > from unlikely and unless this struct is 100% read-only, possibility of
> > data corruption is present. Consider hardware interrupts.
>
> I propose that a single controlling thread does all the write updates.
 I'm afraid single control thread will be a bottleneck and quite limiting in other
 ways. Basically you serialize access algoritmically, thus loose concurrency.
 There should be other ways possible to minimize lock contention without
 loosing concurrency.

> That way, it will know what it shouldn't write to if there are readers
> involved. No locking is therefore required. Depending on how efficient you
> can make this parent thread, then this is the best option. If the parent
> thread is too heavy-weight, then we lose. If it's fast, we win. All it needs
> to do is ICP replies, accept(), index updates and thread context issuing. That
> can be quite fast.

 Actually Solaris has special optimized lock for that - read locks. All threads can
 read locked data, but when write locking it, thread blocks until all readers leave.
 
> > Lock misses are cheap, blocks on locks may be not. Lock bottlenecks
> > becomes an issue, thus the more locks, the more concurrency and
> > efficiency.
>
> I suppose if you had one lock per bucket of the hash table you could
> avoid conflicts almost entirely (assuming the hash algorithm's good). This
> is another way to go and I admit may be a good way to go.

   Eactly that I keep in mind.
 
> > > > > Possibility of instant startup. i.e. mmap() the index, and
> > > > > start using it immediately.
> >
> > Slow. First miss wants to sweep the whole mmaped area, thus you see spike
> > of pageins at which time anything other that reminds squid is stalled.
>
> mmap() can be dangerous for just these reasons. We do want to avoid
> excessive pageins which can stall squid.

 btw, kernel threads would love such scenario. while pagein occurs, other kernel
 threads run, and if many of them pagefault, kernel can optimise disk access alot.
 What I actually meant with my statement about this being slow, was about squid
 select model. Of course, both for threaded and forked models it doesn't matter
 as has been pointed already.

> > You can skip readlocking ONLY if your changes are atomic and
> > writelocked. If your changes are not atomic, you must lock for reads
> > also.
>
> I proposed that all changes happen by the parent which is also then
> aware of any readers involved. I still believe that the parent thread can
> basically give an unconnected mem_object to a sub thread to fill out when it's
> work is done, and when the sub-thread signals it's completion, it "hands-over"
> the mem_object to the parent thread which can then create a store_object from
> it and update the index.

 How would you implement "hand-over" between threads? especially without locks?
 I'm afraid it makes things too complex. What we need is be sure that when one
 thread looks at data, no other thread modifies it, and this is simpliest with locks.
 Single global lock will be a bottleneck but by careful thought we can minimise
 lock collisions alot while still preserving simplicity and concurrency.
 
> Like I said, EXCLUSIVE rights to on object for a sub thread. Any
> readers piggy backing on can just poll for updates or select on the incoming
> socket and get woken up when more MAY be there. They can just read the
> "size of mem_obj" so far. It's an atomic write by the writer. If they get in
   
 Don't be sure. even "Count++" is not atomic and can trash alot if not surrounded
 by mutexes. Of course, in your case it would not matter for reader...But keep in
 mind that when 2 threads both do count++ at the same time without locks, the
 result may become count+1 instead of count+2. Although very unlikely, probabilty
 is not zero and will happen from time to time. eg of this could be object_lock, it
 would be fatal to miss here.

> before the writer, they'll catch up next time round and write two blocks
> instead of one back to the requester. As long as the writer organises the
> order of its operations, this is perfectly safe. The readers (which will have
> been created by the parent) will finish of their transmission when the socket
> closes, and parent thread will be aware that they are still reading and won't
> disturb their stuctures until the readers are done.
>
> This is what I mean by avoiding all explicit locking.

  yes, maybe.

> > > The real question is: Can you do without read locking? I suspect the
> > > answer here is yes, and that's a much more important thing to get
> > > right.
> >
> > I believe not.
>
> I believe so. Read above. Write locking is implicit as it is
> controlled by the parent.

 I'm basing my thoughts on some experiments with threads, where I stress-tested
 interthread communications with locks and hit very high context-switch rates, then
 tried to avoid readlocks. After that I got some very bizarre headaches with threads
 crashing, even thread lib crashing taking along whole system. I tried to find algoritmic
 ways to solve this but always either loosed speed or stability. Of course, I'm no way
 cool programmer, thus happy to learn more.

 
 I always liked to think of threaded programming from Intel's ASCI Red (or similar)
 point of view (u'know that little biggie with 7264 PPros?) where each CPU
 is hungry and eager to crunch some code and every time they hit the wall with a
 sign "one person at a time please" in bunch they must be pissed and think
 "gee, what a lame program today" ;)) This applies equally for excessive lockings
 and for too much serial code, and thinking of zillions of CPU often inspires alot.
 I personally like mutexes and massively parallel code more than algoritmic
 serialisations and would avoid mutex overhead by other optimizations.
 
> We should talk more about this. A process model can work. A thread
> model can work. A process model without locking is almost impossible. A
> thread model without locking is possible, and with locking a definate.
>
> What we are doing here is isolating the pro's and con's of each
> method.
>
> Process model -
> Heavy-weight.
> Can kill machines under large loads.
> Expensive context switching. Worse when we have to add locking.
> Need to carefully manage mmap() shared memory.
> May not work on all OS's that don't do optimistic swap alloc.
> Most portable mechanism.
> Shared memory may cause portability issues.
  simple code. good news for contributors.
  No benefit - no speedup, perprocess limits are changed to per system process limits.
  IMHO - dead-end.
 
> Thread model (with locking)
> Light weight.
> Still have FD resource limitations.
> Possible issue with locking killing benefits.
> Portability problems.
> Maps easily to a process model.
> Doesn't require shared memory.
   also simple, although harder to understand and to be careful.
   perprocess limits still here.
   in theory fastest possible.
 
> Thread Model (without locking)
> Light wieght.
> Still have FD resource limitations.
> Removes issue of locking resource contention killing SMP.
> Adds issue of needing fast parent contorlling thread. May
> impact on reachable levels of SMP.
> Maps less easily to a process model.
> Doesn't require shared memory.
   more complex, more bug-prone.
   same perprocess limits.
 
> Mixed process/thread model
> Medium weight.
> Can tune to be all process or all thread. Helps with
> portability of threads portion. Can code each process with
> the ability to cope with only a single thread. #define in
> ability to use threads.
> Allows user to tune thread/process ratio.
> Requires shared memory.
> Must use locking.
> Accept (on Solaris) is safe across multiple processes (ie, can
> have multiple accept()'ors on a single shared network socket).
> More complicated coding. Need to code for both models.
> Some portability issues, but no more than the straight process
> model at the lowest level (1 thread per process).
   most complex.
   breaks FD limits
   can find best ratio for process/thread limits.
   possibly more robust. (single process can die without total impact)
   allows to use a process per disk/directory or per IP which can be useful.
   mixes best of both - simple programflow, fast, possible to arrange non-stop service.
   in theory if one process per cache dir we can add/remove disks online or tolerate disk failures
   scales well on Cray ;)
   My favourite. ;)
 
> So, now that we have nutted out what the issues are and possibilities,
> which model do we aim for. Once we get our model and specificaton right, then
> comes the sticky issue of who wants to take which portion of the coding.

 uhh, first step is most hard. rewrite it almost all, produce a buggy alpha and then
 fix.fix.fix. core skeleton design is also hard part...

 ----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
 ----------------------------------------------------------------------

--MimeMultipartBoundary--
Received on Tue Jul 29 2003 - 13:15:43 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:26 MST