Re: apache style squid?

From: Michael O'Reilly <michael@dont-contact.us>
Date: 07 Oct 1997 16:36:52 +0800

Stewart Forster <slf@connect.com.au> writes:
> > Has anyone given any thought to building an apache style squid?
>
> Yep. Lots of thought. I;d like to see it done as threads though.

I thought about threads, and indeed I did originally implement some of
this using the LinuxThreads package (with kernel threads).

The biggest issue is portability.

> > i.e.
> > mmap() big chunk of memory for cache index.
> > fork() off a bunch of children.
>
> forking is VERY bad. LOTS of swap used. Lots of context to switch.
> Bad news all over.

This is the part about requiring intelligent memory
management. (i.e. if you're a 256Meg process, and 250meg of that is
mmap()'ed files, it doesn't make sense to allocate backing
store... etc etc. )

The context switching shouldn't be too bad given that the vast majory
of the time, the context switch should take place when a process is
sleeping, so it's already switched context into the kernel anyway.
 
> Using threads means no mmap() and no fork()'ing. Very efficient.

I guess I'm not convinced. Kernel threads (at least in sunos and
linux) are very close to the process weight anyway. I suspect this may
be a religious issue. :)

> > One of the children does:
> > sleep on recvfrom(), and handle ICP requests.
>
> I'd have 1 thread doing the ICP like you say. But it should also be
> the thread responsible for controlling the cache index. It does the recvfrom
> and accept() via a poll() or select() (as is done now).

One of the big advantages of fork()/kernel threads is that you can
afford to sleep on things like accept() and recvfrom(). i.e. you don't
need to use select() or poll() to keep checking if things are ready
yet.

> When an accept comes in, the appropriate context is created for a
> sub-thread to EXCLUSIVELY control that connection without any subsequent
> interaction to locking until that thread has completed.

Nod. Nothing that thread creation is expensive, you'd normally wake a
thread from a pool of idle threads.
 
> We want to have NO locking going on at all. Just a master thread
> firing off completely independent sub-threads as fast as it can. Firing
> off threads would be about 10% of total work to do, so I'm guessing that
> this sort of thing could scale to 10 CPU's happily.

Oh, that's an advantage of fork()'ing. You get to use those cute SMP
machines. :) [ does Solaris kernel threads use the SMP stuff? I guess
it must ]. I'd forgotten about that.

Note that you will need a little bit of locking happening with
regard to updating the cache index, but because it's purely a memory
stucture I'd be _really_ suprised if you saw more than a lock busy
more than .1% of the time. (you'd normally only hold the lock for a
few instructions, and you'd normally just have woken up, so you're
unlikely to get switched out due to end of timeslice).

I really don't see the locking as a very serious issue.
 
> > At startup, you can also fork off a child to read in the cache index.
>
> The cache index needs to be a transaction log too. I've already
> modified our squid to do this. There IS no validation stage or FAST or SLOW
> startups. The index is just sucked in as fast as it can. This
> should be done
> be the index/accept/ICP controlling master thread.

So you're doing lazy validation of the index?

> > More efficent disk use. You can now queue multiple requests
> > into the disk subsystem (because you don't need to
> > wait until one completes). This means that if you have
> > RAID'ed disks, or multiple disks, you actually get
> > higher thruput out of them.
>
> Still applies. In fact with our threaded code we've seen a 10x
> speedup due to better disk utilisation/non-blocking on disk reads. We don't
> have full threads yet, just threaded read/write/open/close/unlink.

Is that using the asyncio stuff, or firing off kernel threads? If the
latter, I really wouldn't mind a copy of that code... :)

I've had a couple of goes at that myself, but ran out of time to debug
it.

> > Don't need large amounts of file descriptors per process. Each
> > process is only handling 1 request at a time, so if
> > it's using more than 15 file descriptors I'd be
> > suprised.
>
> Threads still a problem. You'd only have one process. Still, we're pushing
> through 3M TCP hits/day (+7M UDP hits), per cache box and we still are only
> using 2000 fd's maximum. There's a bug which we removed which
> prevented squid
> from killing dead/stalled connections. Once fixed, that no longer becomes
> a problem. With separate sub-threads, connections could probably be closed
> more safely than with the current model, and this saving even more on FD
> usage.

True.
 
[ ... ]
> > On a 'slow' start where you need to stat each file, you can
> > fire off one process per disk. MUCH faster startups.
>
> Do a transaction based disk index log and this issue completely
> disappears. 50 lines of code on top of the current squid code base
> currently.

There is still a mild issue here w.r.t. damaged caches / machine
crashes. Fsck(8) doing wild things can cause a little havoc here.
 
> > Some of the above could be summerised by "the kernel scheduler
> > automagically handles request load balancing for you".
>
> Threads scheduling is WAY more efficient.

Hmm. This depends a bit I think. Since many thread schedulers do FIFO,
it is fairly simple, and yes, you'll win a bit. If you mean context
switching rather than scheduling, I seem to recall the solaris kernel
thread switching times and being the same or higher than the process
context switch times. (whip out lmbench I guess and measure it).
 
> > Some memory management wins in being able to alloc() a bunch
> > of private memory, and then exit() the process, having
> > the parent re-start you. I.e. guarenteed return of
> > memory to the system.
>
> Forking will EAT memory by the bucket loads. Each sub-process needs
> it's own stack register set, stack, volatile memory set, etc, etc.

The base usage is about 24K per process on my box. So 200 * 24 ==
5megs of ram == 1 or 2%.
 
> > Possibility of instant startup. i.e. mmap() the index, and
> > start using it immediately.
>
> That's a good idea. Problem is keeping it in sync. What data format
> do you think is going to be robust and fast both for on disk and in-memory
> use? The current suck format is fine. A binary format would be better but
> there's the problem with all those variable sized URLs. :(

Depends on if we could reliably mmap() it at the same address each
time.. :)

If yes, then just use a normal memory image. If no, it gets a little
harder. All the pointers need to be relative to the start of the
mmap.

Format as such isn't a problem. Just alloc everything out of the
mmap()ed memory. url and all. No dramas. in theory. :)
 
> > Disadvantages:
> > Slightly more complicated memory management. Need to make sure
> > shared memory is allocated out of the mmap() pool.
>
> Therads avoids this completely. All memory is automatically shared
> if you so wish.

Yes and no. You do get advantages from true private resources. I'll
grant that threads do avoid some of the complexities, but do it by
trading off on the safety issues.
 
> > Locking. In theory, this should be minimal. Only operations
> > that change the index should need locking. There's
> > also a little issue here with passing control if the
> > lock you want is locked by someone else. Linux has
> > sched_yield(), and I guess
> > select(0, NULL, NULL, NULL, {0, 11000} ) would work
> > for most other things.
>
> I'd like to avoid all locking, and I believe that's 100% possible.
> Locking involves waiting on semaphores which can mean lots of time wasted
> waiting to be rescheduled to a two-bit operation.

Well, your normal lock should be.

        while (!test_and_set(&var)) yeild();
        ++counter;
        reset(&var);

Now that code segement shouldn't be more than a few instructions long,
and most of the time you'll have just woken from sleep'ing, so the
yield() should almost _never_ be called. Not a big issue AFAIK.
 
> Each thread needs to have enough state from which to work from, and
> perhaps even have a link count for multiple threads serving the same URL
> while it's being gotten. Conceivably there would be two types of threads.
>
> - A thread type to grab the object from the source - can do FTP, HTTP,
> etc, etc.
> - A thread type to serve a user. These threads can either transfer
> an object that's currently in transit by READING the status of the
> object get thread, or can simply serve an object from disk.
>
> No locking would be required. Each thread that's piggy-backed from
> a GET thread can poll on the GET thread's incoming socket to wake themselves
> up when more data arrives. Alternatively you could do thread mutex stuff
> here, but I'd like to avoid that.

You still need locking. What about expireing objects? updating LRU
lists? hashs? Unless you can do all these as atomic operations, or
confine all these operation to one thread, you'll need write locking.

The real question is: Can you do without read locking? I suspect the
answer here is yes, and that's a much more important thing to get
right.

[ .. ]

> > Lots more processes. Expect a couple of hundred processes for
> > a busy cache I'd guess..
>
> That cannot be done. For us, we could have up to 1000 processes to
> service our current load. The kernel run queue would be enormous and all
> that context switching will absolutely KILL most RISC based CPU's.

Well, as far as I can see, out load is about 50 - 75% of yours. With
(say) 1000 requests 'in progress' the number awake at any one point is
much lower. (say 30 - 100?) which is a much more manageable run
queue.

I'm not 100% sold on fork()'ing per se, it's just more portable which
should increase the code reliability/usability in the longer term.
 
> > Almost certainly need to allocate the mmap() for a fixed size
> > (at least,if you want it portable)
>
> Threads are portable and require no mmap().

When you say 'threads', do you mean kernel threads, or user mode
threads?

The only implementations of kernel threads that I'm aware of are for
Solaris, and Linux.

Usermode threads can't give you want you want. In particular, when one
thread blocks on disk, they all block.

> > Slightly stiffer porting requirements. (now requires a working
> > mmap() implementation, and a decent VM
> > system. i.e. real copy-on-write)
>
> Use pthreads(). Quite portable. Any OS worth it's salt will have
> pthread's implemented on top of a good kernel thread underlying layer.

That seems to imply there are currently 2 implemetations of unix worth
their salt AFAIK. ;-)
 
> > Possibly use more memory? This is a maybe. squid NOVM would be
> > much more viable now that you can run the disk
> > accesses concurrently.
>
> With threads you'd use less memory because you'd avoid all the
> callback overheads and buffering. Lots of processes will chew memory like
> nothing on this planet.

A few quick tests (main() { int i; for (i=0;i < 7;++i){fork();}
sleep(60);}) and measureing memory before and after seems to suggest
fairly minimal memory per process. 24 - 32K which is larger than a
usermode thread @ 4K, but not too bad.

[ .. ]
> > I had a quick start at hacking the code to do it, and was struck by
> > how much code I was deleting ("remove these 4 functions, replace
> > with 1 function that's a fraction of the size.. ").
> >
> > Comments? Anyone else interested in coding this up?
>
> I'd love to as a project. I suppose I could do it in my free time.
> I'd need to have a bit of free reign over a split development tree
> with Duane's
> help so the mods can be folded back easily.
>
> I estimate about 3 months of programming effort would be required to
> get this to fly, be reliable, tested and portable and documented.

Laugh. I was hoping more for a few days to get something up, a week of
so of debugging, and voila! Unfortunately, your estimate is much more
realistic.

Michael.
Received on Tue Jul 29 2003 - 13:15:43 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:25 MST