Re: apache style squid? from Stewart Forster on 1997-10-07 (squid-dev)

From: Stewart Forster <slf@dont-contact.us>
Date: Tue, 07 Oct 1997 17:56:58 +1000

--MimeMultipartBoundary
Content-Type: text/plain; charset=us-ascii

>
> Has anyone given any thought to building an apache style squid?

Yep. Lots of thought. I;d like to see it done as threads though.
>
> i.e.
> mmap() big chunk of memory for cache index.
> fork() off a bunch of children.

forking is VERY bad. LOTS of swap used. Lots of context to switch.
Bad news all over.

Using threads means no mmap() and no fork()'ing. Very efficient.

> children do:
> accept(), and then process request in a single threaded
> fashion, locking if they need to access the shared
> memory holding the cache index.
> loop.
>
> One of the children does:
> sleep on recvfrom(), and handle ICP requests.

I'd have 1 thread doing the ICP like you say. But it should also be
the thread responsible for controlling the cache index. It does the recvfrom
and accept() via a poll() or select() (as is done now).

When an accept comes in, the appropriate context is created for a
sub-thread to EXCLUSIVELY control that connection without any subsequent
interaction to locking until that thread has completed.

We want to have NO locking going on at all. Just a master thread
firing off completely independent sub-threads as fast as it can. Firing
off threads would be about 10% of total work to do, so I'm guessing that
this sort of thing could scale to 10 CPU's happily.

> At startup, you can also fork off a child to read in the cache index.

The cache index needs to be a transaction log too. I've already
modified our squid to do this. There IS no validation stage or FAST or SLOW
startups. The index is just sucked in as fast as it can. This should be done
be the index/accept/ICP controlling master thread.

> Advantages:
> No *&^&*^*&^&* blocking on disk! If one of the children does a
> slow operation, then it leaves all the other request
> processes running full speed.

Still applies.

> More efficent disk use. You can now queue multiple requests
> into the disk subsystem (because you don't need to
> wait until one completes). This means that if you have
> RAID'ed disks, or multiple disks, you actually get
> higher thruput out of them.

Still applies. In fact with our threaded code we've seen a 10x
speedup due to better disk utilisation/non-blocking on disk reads. We don't
have full threads yet, just threaded read/write/open/close/unlink.

> MUCH simpler code. Don't need all the zillions of callbacks
> etc. You can write 'slow' code without need to split
> it into seperate functions.

Same applies. No callbacks required. Just create the context and
then the thread can stall all it wants.

> Smaller code. see above.

Ditto.

> Don't need large amounts of file descriptors per process. Each
> process is only handling 1 request at a time, so if
> it's using more than 15 file descriptors I'd be
> suprised.

Threads still a problem. You'd only have one process. Still, we're pushing
through 3M TCP hits/day (+7M UDP hits), per cache box and we still are only
using 2000 fd's maximum. There's a bug which we removed which prevented squid
from killing dead/stalled connections. Once fixed, that no longer becomes
a problem. With separate sub-threads, connections could probably be closed
more safely than with the current model, and this saving even more on FD
usage.

> Much less polling. The ICP listener is in a seperate process,
> and can sleep on recvfrom() waiting for a packet.
> Ditto children listening for replies.

Same applies.

> On a 'slow' start where you need to stat each file, you can
> fire off one process per disk. MUCH faster startups.

Do a transaction based disk index log and this issue completely
disappears. 50 lines of code on top of the current squid code base currently.

> On a fast start, you can have a process dedicated to reading
> the log. again, much faster startup because it can
> occur at something close to disk speed, without
> significantly slowing down other request processing.

Irrelevant.

> Some of the above could be summerised by "the kernel scheduler
> automagically handles request load balancing for you".

Threads scheduling is WAY more efficient.

> Some memory management wins in being able to alloc() a bunch
> of private memory, and then exit() the process, having
> the parent re-start you. I.e. guarenteed return of
> memory to the system.

Forking will EAT memory by the bucket loads. Each sub-process needs
it's own stack register set, stack, volatile memory set, etc, etc.

> Don't need as much buffering when you sleep on read()'s,
> write()'s et al. i.e. lighter malloc()
> load. i.e. lower memory fragmentation in theory. :)

True we don't need as much buffering. A good malloc() library solves
most memory fragmentation issues.

> Request processing doesn't stop dead when the logs need to be
> rotated. (a 3 million line log takes a while to dump
> to disk!)

Same applies.

> Possibility of instant startup. i.e. mmap() the index, and
> start using it immediately.

That's a good idea. Problem is keeping it in sync. What data format
do you think is going to be robust and fast both for on disk and in-memory
use? The current suck format is fine. A binary format would be better but
there's the problem with all those variable sized URLs. :(

> Disadvantages:
> Slightly more complicated memory management. Need to make sure
> shared memory is allocated out of the mmap() pool.

Therads avoids this completely. All memory is automatically shared
if you so wish.

> Locking. In theory, this should be minimal. Only operations
> that change the index should need locking. There's
> also a little issue here with passing control if the
> lock you want is locked by someone else. Linux has
> sched_yield(), and I guess
> select(0, NULL, NULL, NULL, {0, 11000} ) would work
> for most other things.

I'd like to avoid all locking, and I believe that's 100% possible.
Locking involves waiting on semaphores which can mean lots of time wasted
waiting to be rescheduled to a two-bit operation.

Each thread needs to have enough state from which to work from, and
perhaps even have a link count for multiple threads serving the same URL
while it's being gotten. Conceivably there would be two types of threads.

        - A thread type to grab the object from the source - can do FTP, HTTP,
          etc, etc.
        - A thread type to serve a user. These threads can either transfer
          an object that's currently in transit by READING the status of the
          object get thread, or can simply serve an object from disk.

No locking would be required. Each thread that's piggy-backed from
a GET thread can poll on the GET thread's incoming socket to wake themselves
up when more data arrives. Alternatively you could do thread mutex stuff
here, but I'd like to avoid that.

When the GET thread completes, the object is converted to an on-disk
type, and the residual user-serving threads will complete 'cos the incoming
GET thread's socket will be closed and the poll() on that socket will return
the approriate error which can be caught. Once all the readers have done,
the master thread can free up that context. The master thread will know
when a thread is done 'cos it gets signalled in much the same way as when
a child process dies and signals its parent.

> Lots more processes. Expect a couple of hundred processes for
> a busy cache I'd guess..

That cannot be done. For us, we could have up to 1000 processes to
service our current load. The kernel run queue would be enormous and all
that context switching will absolutely KILL most RISC based CPU's.

> Almost certainly need to allocate the mmap() for a fixed size
> (at least,if you want it portable)

Threads are portable and require no mmap().

> Slightly stiffer porting requirements. (now requires a working
> mmap() implementation, and a decent VM
> system. i.e. real copy-on-write)

Use pthreads(). Quite portable. Any OS worth it's salt will have
pthread's implemented on top of a good kernel thread underlying layer.

> Possibly use more memory? This is a maybe. squid NOVM would be
> much more viable now that you can run the disk
> accesses concurrently.

With threads you'd use less memory because you'd avoid all the
callback overheads and buffering. Lots of processes will chew memory like
nothing on this planet.

> Uses a little more global sockets/file descriptors (each
> process has a log handle open, each has an ICP request
> socket open etc).

These are all shared under threads, it just that each thread only
pays attention to its own FD's.

> I had a quick start at hacking the code to do it, and was struck by
> how much code I was deleting ("remove these 4 functions, replace
> with 1 function that's a fraction of the size.. ").
>
> Comments? Anyone else interested in coding this up?

I'd love to as a project. I suppose I could do it in my free time.
I'd need to have a bit of free reign over a split development tree with Duane's
help so the mods can be folded back easily.

I estimate about 3 months of programming effort would be required to
get this to fly, be reliable, tested and portable and documented.

Cheers,

Stew.

--MimeMultipartBoundary--
Received on Tue Jul 29 2003 - 13:15:43 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:25 MST