Re: apache style squid?

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Tue, 7 Oct 1997 15:28:59 +0200 (EETDST)

--MimeMultipartBoundary
Content-type: text/plain; charset=US-ASCII
Content-transfer-encoding: 7BIT

> > > Has anyone given any thought to building an apache style squid?
> >
> > Yep. Lots of thought. I;d like to see it done as threads though.
 
 Both have their pros and gotchas. Basically I believe none of these
 models are good by themselves. With plain apache style you have
 huge number of processes each servicing just single request - resource
 waste. With threads-only you have the same perprocess limits you wish
 to break out of. Both models introduce locking and IPC troubles.
 IMHO, solving problems for one, you are ready to use either. Or both.

 Ideally, imho, there should run multiple processes with shared index,
 each serving multiple concurrent sessions. If configured to use either
 zillions-single-thread-processes it might work well on OS'es that have
 no thread support or better perprocess memory usage, while
 other OS'es that love threads more than processes can run as
 zillions-threads-in-few-processes.
 
> I thought about threads, and indeed I did originally implement some of
> this using the LinuxThreads package (with kernel threads).
>
> The biggest issue is portability.

   It may be, yes. But to me, on Solaris it is really right to go threads.
 
> > forking is VERY bad. LOTS of swap used. Lots of context to switch.
> > Bad news all over.
>
> This is the part about requiring intelligent memory
> management. (i.e. if you're a 256Meg process, and 250meg of that is
> mmap()'ed files, it doesn't make sense to allocate backing
> store... etc etc. )
>
> The context switching shouldn't be too bad given that the vast majory
> of the time, the context switch should take place when a process is
> sleeping, so it's already switched context into the kernel anyway.

 IPC? Shared memory locks, all add process switches. Although this
 might be small overhead.
  
> > Using threads means no mmap() and no fork()'ing. Very efficient.
>
> I guess I'm not convinced. Kernel threads (at least in sunos and
> linux) are very close to the process weight anyway. I suspect this may
> be a religious issue. :)

 Well, depends. thread switch of kernel threads is comparable, but not
 thread creation/exiti. threads are much faster here.
 
> One of the big advantages of fork()/kernel threads is that you can
> afford to sleep on things like accept() and recvfrom(). i.e. you don't
> need to use select() or poll() to keep checking if things are ready
> yet.
  
 Very nice feature for client reads when hits are serviced is to
 mmap diskfile to memory and then simply issue full write to client from
 this memory in one go of object size. No kernel-userspace buffer copies.
 Kernel does file io and socket writes.
 
> > When an accept comes in, the appropriate context is created for a
> > sub-thread to EXCLUSIVELY control that connection without any subsequent
> > interaction to locking until that thread has completed.
>
> Nod. Nothing that thread creation is expensive, you'd normally wake a
> thread from a pool of idle threads.
 
 Agreed. Even that thread creation overhead is small, there's no point in
 creating and killing threads per request.
  
> > We want to have NO locking going on at all. Just a master thread
> > firing off completely independent sub-threads as fast as it can. Firing
> > off threads would be about 10% of total work to do, so I'm guessing that
> > this sort of thing could scale to 10 CPU's happily.

> Note that you will need a little bit of locking happening with
> regard to updating the cache index, but because it's purely a memory
> stucture I'd be _really_ suprised if you saw more than a lock busy
> more than .1% of the time. (you'd normally only hold the lock for a
> few instructions, and you'd normally just have woken up, so you're
> unlikely to get switched out due to end of timeslice).
 
 Not only a little, you'd need to lock every shared structure that can
 change for both - writing AND reading. Assumption that while we read
 multi-item struct the thread (process) switch would NOT occur is far from
 from unlikely and unless this struct is 100% read-only, possibility of
 data corruption is present. Consider hardware interrupts.
 
> I really don't see the locking as a very serious issue.
 
 Lock misses are cheap, blocks on locks may be not. Lock bottlenecks
 becomes an issue, thus the more locks, the more concurrency and
 efficiency.
  
> > Still applies. In fact with our threaded code we've seen a 10x
> > speedup due to better disk utilisation/non-blocking on disk reads. We don't
> > have full threads yet, just threaded read/write/open/close/unlink.
>
> Is that using the asyncio stuff, or firing off kernel threads? If the
> latter, I really wouldn't mind a copy of that code... :)
   me too ;)
 
> > > Some of the above could be summerised by "the kernel scheduler
> > > automagically handles request load balancing for you".
> >
> > Threads scheduling is WAY more efficient.
>
> Hmm. This depends a bit I think. Since many thread schedulers do FIFO,
> it is fairly simple, and yes, you'll win a bit. If you mean context
> switching rather than scheduling, I seem to recall the solaris kernel
> thread switching times and being the same or higher than the process
> context switch times. (whip out lmbench I guess and measure it).

 user-threads are fast - allows you write straightforward from-start-to-end
 request service model. kernel threads are like processes - independantly
 sheduled "fire-and-forget" type.
  
> > Forking will EAT memory by the bucket loads. Each sub-process needs
> > it's own stack register set, stack, volatile memory set, etc, etc.
>
> The base usage is about 24K per process on my box. So 200 * 24 ==
> 5megs of ram == 1 or 2%.

 On Solaris, you never know what the actual ram usage is, but this eats swap.
 Bad news for Solaris guys. And lets don't get into OS wars please...
  
> > > Possibility of instant startup. i.e. mmap() the index, and
> > > start using it immediately.

 Slow. First miss wants to sweep the whole mmaped area, thus you see spike
 of pageins at which time anything other that reminds squid is stalled.

> > > Disadvantages:
> > > Slightly more complicated memory management. Need to make sure
> > > shared memory is allocated out of the mmap() pool.
> >
> > Therads avoids this completely. All memory is automatically shared
> > if you so wish.
>
> Yes and no. You do get advantages from true private resources. I'll
> grant that threads do avoid some of the complexities, but do it by
> trading off on the safety issues.
 
   What do you mean by safety tradeoffs?
  
> > > Locking. In theory, this should be minimal. Only operations
> > > that change the index should need locking. There's
> > > also a little issue here with passing control if the
> > > lock you want is locked by someone else. Linux has
> > > sched_yield(), and I guess

 You can skip readlocking ONLY if your changes are atomic and
 writelocked. If your changes are not atomic, you must lock for reads
 also.

> > I'd like to avoid all locking, and I believe that's 100% possible.
> > Locking involves waiting on semaphores which can mean lots of time wasted
> > waiting to be rescheduled to a two-bit operation.

  Use many locks, avoid lock bottlenecks, and your are safe and fast.

> Well, your normal lock should be.
>
> while (!test_and_set(&var)) yeild();
> ++counter;
> reset(&var);
>
> Now that code segement shouldn't be more than a few instructions long,
> and most of the time you'll have just woken from sleep'ing, so the
> yield() should almost _never_ be called. Not a big issue AFAIK.

 yield() in not a good solution. mutex_locks are meant for this. yield()
 in many circumstances returns immediately, making such example
 while_do_nothing_loop
  
> > - A thread type to serve a user. These threads can either transfer
> > an object that's currently in transit by READING the status of the
> > object get thread, or can simply serve an object from disk.
> >
> > No locking would be required. Each thread that's piggy-backed from
> > a GET thread can poll on the GET thread's incoming socket to wake themselves
> > up when more data arrives. Alternatively you could do thread mutex stuff
> > here, but I'd like to avoid that.
 Doesn't kernel mutex_lock per FD? There is finite possibility that client thread
 awakes before GET thread has got actual data. Better handle this yourself.

> The real question is: Can you do without read locking? I suspect the
> answer here is yes, and that's a much more important thing to get
> right.

   I believe not.
 
> I'm not 100% sold on fork()'ing per se, it's just more portable which
> should increase the code reliability/usability in the longer term.

 Apache style forking is no-go. few processes each using either
 select or thread model is much better, IMHO.
  
> > > I had a quick start at hacking the code to do it, and was struck by
> > > how much code I was deleting ("remove these 4 functions, replace
> > > with 1 function that's a fraction of the size.. ").
> > >
> > > Comments? Anyone else interested in coding this up?

  I am. Definitely.

 ----------------------------------------------------------------------
  Andres Kroonmaa mail: andre@online.ee
  Network Manager
  Organization: MicroLink Online Tel: 6308 909
  Tallinn, Sakala 19 Pho: +372 6308 909
  Estonia, EE0001 http://www.online.ee Fax: +372 6308 901
 ----------------------------------------------------------------------

--MimeMultipartBoundary--
Received on Tue Jul 29 2003 - 13:15:43 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:25 MST