Re: apache style squid? from Stewart Forster on 1997-10-07 (squid-dev)

From: Stewart Forster <slf@dont-contact.us>
Date: Wed, 08 Oct 1997 12:05:46 +1000

--MimeMultipartBoundary
Content-Type: text/plain; charset=us-ascii

Right, this posting has a defense on thread's without locking, but
more importantly, an unbiased summary of the views expressed in post so far
at the end presented in a "by-each-model" format, so help choose what we're
really aiming to do.

>
> > > > Has anyone given any thought to building an apache style squid?
> > >
> > > Yep. Lots of thought. I;d like to see it done as threads though.
> Ideally, imho, there should run multiple processes with shared index,
> each serving multiple concurrent sessions. If configured to use either
> zillions-single-thread-processes it might work well on OS'es that have
> no thread support or better perprocess memory usage, while
> other OS'es that love threads more than processes can run as
> zillions-threads-in-few-processes.

Yes, but that still needs to have locking on data structures. The
University of Melbourne experimented with threading squid and mutex()'ing
the index updates and found that the lokcing overheads meant there was only
about minor speedup on a multi-CPU machine. This is my reasoning for wanting
to avoid any form of locking.

> > The biggest issue is portability.
> It may be, yes. But to me, on Solaris it is really right to go threads.

Solaris's kernel threads are good, but there are also slightly broken
(but not enough such that you can't get around it). It shouldn't be too much
effort however to provide portability for a wide range of OS's through
#defines and calling the appropriate thread call based on the thread package
being used. I'm proposing ONLY using thread_create() to create a pool of
threads. mutex() locks can then be used to control scheduling of those
threads when their context areas have been updated.

> IPC? Shared memory locks, all add process switches. Although this
> might be small overhead.

They do. For a LARGE and heavily used cache, what looks small to
the average user can KILL heavy resource sharing. Believe me, I've seen it.

> > > Using threads means no mmap() and no fork()'ing. Very efficient.
> >
> > I guess I'm not convinced. Kernel threads (at least in sunos and
> > linux) are very close to the process weight anyway. I suspect this may
> > be a religious issue. :)
>
> Well, depends. thread switch of kernel threads is comparable, but not
> thread creation/exiti. threads are much faster here.

We've found threads to be quite lightweight. It depends on the type
of thread you're creating, and whether or not you're allocating out of a bound
thread pool or not. It's not that simple.

> > > We want to have NO locking going on at all. Just a master thread
> > > firing off completely independent sub-threads as fast as it can. Firing
> > > off threads would be about 10% of total work to do, so I'm guessing that
> > > this sort of thing could scale to 10 CPU's happily.
>
> > Note that you will need a little bit of locking happening with
> > regard to updating the cache index, but because it's purely a memory
> > stucture I'd be _really_ suprised if you saw more than a lock busy
> > more than .1% of the time. (you'd normally only hold the lock for a
> > few instructions, and you'd normally just have woken up, so you're
> > unlikely to get switched out due to end of timeslice).
>
> Not only a little, you'd need to lock every shared structure that can
> change for both - writing AND reading. Assumption that while we read
> multi-item struct the thread (process) switch would NOT occur is far from
> from unlikely and unless this struct is 100% read-only, possibility of
> data corruption is present. Consider hardware interrupts.

I propose that a single controlling thread does all the write updates.
That way, it will know what it shouldn't write to if there are readers
involved. No locking is therefore required. Depending on how efficient you
can make this parent thread, then this is the best option. If the parent
thread is too heavy-weight, then we lose. If it's fast, we win. All it needs
to do is ICP replies, accept(), index updates and thread context issuing. That
can be quite fast.

> Lock misses are cheap, blocks on locks may be not. Lock bottlenecks
> becomes an issue, thus the more locks, the more concurrency and
> efficiency.

I suppose if you had one lock per bucket of the hash table you could
avoid conflicts almost entirely (assuming the hash algorithm's good). This
is another way to go and I admit may be a good way to go.

> > > Still applies. In fact with our threaded code we've seen a 10x
> > > speedup due to better disk utilisation/non-blocking on disk reads. We don't
> > > have full threads yet, just threaded read/write/open/close/unlink.
> >
> > Is that using the asyncio stuff, or firing off kernel threads? If the
> > latter, I really wouldn't mind a copy of that code... :)
> me too ;)

In squid 1.2 alpha stuff. I've done more stuff but we work on a 1.1.1
base here. I've ripped apart quite substantial chunks of squid to get it to
work by adding callbacks to each funtion that calls
read/write/open/close/unlink
If you're running Solaris I can give you a copy of the binary, but it treats
the cache swap index file as a transaction log, and assumes that the directory
cleanup procedure (which is also a sub-thread) and squid's ability to treat
missing swap files as a MISS to get around any filesystem errors. There is
simply no need to stat each and every file ever. Squid has a nice robust
model to work from that doesn't really care if the disk is stuffed. I didn't
see the need to make so much effort when it isn't required.

> user-threads are fast - allows you write straightforward from-start-to-end
> request service model. kernel threads are like processes - independantly
> sheduled "fire-and-forget" type.

User threads will be difficult to handle since they can still block
the whole process on a system call.

> > > > Possibility of instant startup. i.e. mmap() the index, and
> > > > start using it immediately.
>
> Slow. First miss wants to sweep the whole mmaped area, thus you see spike
> of pageins at which time anything other that reminds squid is stalled.

mmap() can be dangerous for just these reasons. We do want to avoid
excessive pageins which can stall squid.

> > Yes and no. You do get advantages from true private resources. I'll
> > grant that threads do avoid some of the complexities, but do it by
> > trading off on the safety issues.
>
> What do you mean by safety tradeoffs?

I think he means a thread going rampant and scribbling over another
thread's space. This is not really an issue since anything with that much
wrong with it is going to be unstable, threaded or not.

> You can skip readlocking ONLY if your changes are atomic and
> writelocked. If your changes are not atomic, you must lock for reads
> also.

I proposed that all changes happen by the parent which is also then
aware of any readers involved. I still believe that the parent thread can
basically give an unconnected mem_object to a sub thread to fill out when it's
work is done, and when the sub-thread signals it's completion, it "hands-over"
the mem_object to the parent thread which can then create a store_object from
it and update the index.

Like I said, EXCLUSIVE rights to on object for a sub thread. Any
readers piggy backing on can just poll for updates or select on the incoming
socket and get woken up when more MAY be there. They can just read the
"size of mem_obj" so far. It's an atomic write by the writer. If they get in
before the writer, they'll catch up next time round and write two blocks
instead of one back to the requester. As long as the writer organises the
order of its operations, this is perfectly safe. The readers (which will have
been created by the parent) will finish of their transmission when the socket
closes, and parent thread will be aware that they are still reading and won't
disturb their stuctures until the readers are done.

This is what I mean by avoiding all explicit locking.

> Use many locks, avoid lock bottlenecks, and your are safe and fast.

Assuming that there are many distinct locks to create.

>
> > > - A thread type to serve a user. These threads can either transfer
> > > an object that's currently in transit by READING the status of the
> > > object get thread, or can simply serve an object from disk.
> > >
> > > No locking would be required. Each thread that's piggy-backed from
> > > a GET thread can poll on the GET thread's incoming socket to wake themselves
> > > up when more data arrives. Alternatively you could do thread mutex stuff
> > > here, but I'd like to avoid that.
> Doesn't kernel mutex_lock per FD? There is finite possibility that client thread
> awakes before GET thread has got actual data. Better handle this yourself.
>
> > The real question is: Can you do without read locking? I suspect the
> > answer here is yes, and that's a much more important thing to get
> > right.
>
> I believe not.

I believe so. Read above. Write locking is implicit as it is
controlled by the parent.

> > > > Comments? Anyone else interested in coding this up?
>
> I am. Definitely.

We should talk more about this. A process model can work. A thread
model can work. A process model without locking is almost impossible. A
thread model without locking is possible, and with locking a definate.

What we are doing here is isolating the pro's and con's of each
method.

        Process model -
                Heavy-weight.
                Can kill machines under large loads.
                Expensive context switching. Worse when we have to add locking.
                Need to carefully manage mmap() shared memory.
                May not work on all OS's that don't do optimistic swap alloc.
                Most portable mechanism.
                Shared memory may cause portability issues.

        Thread model (with locking)
                Light weight.
                Still have FD resource limitations.
                Possible issue with locking killing benefits.
                Portability problems.
                Maps easily to a process model.
                Doesn't require shared memory.

        Thread Model (without locking)
                Light wieght.
                Still have FD resource limitations.
                Removes issue of locking resource contention killing SMP.
                Adds issue of needing fast parent contorlling thread. May
                impact on reachable levels of SMP.
                Maps less easily to a process model.
                Doesn't require shared memory.

        Mixed process/thread model
                Medium weight.
                Can tune to be all process or all thread. Helps with
                portability of threads portion. Can code each process with
                the ability to cope with only a single thread. #define in
                ability to use threads.
                Allows user to tune thread/process ratio.
                Requires shared memory.
                Must use locking.
                Accept (on Solaris) is safe across multiple processes (ie, can
                have multiple accept()'ors on a single shared network socket).
                More complicated coding. Need to code for both models.
                Some portability issues, but no more than the straight process
                model at the lowest level (1 thread per process).

So, now that we have nutted out what the issues are and possibilities,
which model do we aim for. Once we get our model and specificaton right, then
comes the sticky issue of who wants to take which portion of the coding. Also
there's co-ordination with Duane. Sorry Duane, I know you haven't even
released 1.2, and already we're talking about implementing squid 2.0.

Cheers,

Stew.

--MimeMultipartBoundary--
Received on Tue Jul 29 2003 - 13:15:43 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:26 MST