Re: Squid performance wish-list

From: Michael O'Reilly <michael@dont-contact.us>
Date: 21 Aug 1998 16:05:04 +0800

Stewart Forster <slf@connect.com.au> writes:
>
> *) Priority queues in select() loop:
[ .. ]
> Half-closed connections
>
> It would seem that squid spends possibly a lot of time rechecking
> for activity
> on sockets that may not need it. For example, if we've just sent a request
> to a server, we probably won't expect a reply back immediately so that server
> socket should be exempted from a poll check until say 3 times around the
> poll loop.

This is probably a bad example. You want to respond quickly to fast
machines, and slowly to slow machines. What might be more useful is to
work out something like "If it wasn't ready last time, it's probably
not ready now, so skip it once. If it's still not ready, skip it
twice". etc etc.

The idea being that 80% of your activity occurs on 20% of your
connections, so you want to keep checking active ones, but slow down
the checking on sleeping ones.
 
> Idle connections (while not polled explicity) still have to scanned each
> time through the select() loop to see if they have a handler attached. This
> seems like a waste. We should only be putting sockets onto the select list
> if they are on a list of possible sockets, not scanning each time through to
> find which ones.
>
> I predict that with an efficient select() loop mechanism about 10% CPU could
> be saved.

The huge one here is incremental updates of the select/poll
structures. It's pretty silly to have nice pretty functions that all
the handler updates go thru, but then every call to comm_select()
searching the entire list to find which ones have handlers.

Adding a couple of global fd_set's (pollfd's) which are updated every
time a handler is added/deleted, and then having comm_select do
        fd_set tmp_read = global_read;
is a hell of a lot faster than running a loop over every fd_table
entry. (i.e. a straight memory copy instead of rebuilding it from
scratch).
 
> *) Capping number of sockets handled per select() loop.
>
> With a large select loop, it may be possible that it takes a long time to
> get to te end and call the next eventRun(). This can hinder squid's cleanup
> and maintenance mechanisms. Perhaps breaking up LARGE select chunks into a
> few smaller ones would be good.

What are you trying to optimize for here? This sounds like a latency
optimization. It would probably be a better idea to have the squid
cleanup/maintenance in a seperate thread or such like. Even just
having the wakeups in a seperate thread, and have the comm_select()
execution loop poll the global variable.

Or better yet, just make the cleanup/maintain routines notice how much
work is actually needed, rather than assuming they'll be called a
minimum of X times per time period.
 
> *) Moving ICP query responder/receiver to a sub-thread.
>
> This will be a BIG win by removing the ICP poll out of the main thread and
> the associated processing that goes with it. This function has two
> operations.

Yes, this is long overdue. I seem to recall there is some headache
here w.r.t walking the global structures. Doesn't some locking need to
be introduced or something?

Hmm. You might be able to play fast and loose by versioning the global
structures, and having the ICP thread check the version often
(i.e. every time it would dereference a pointer, it checks the
version, and if it's changed, restarts).
 
> 1) Sending ICP queries and receiving replies
>
> 2) Responding to ICP queries
>
> By breaking this out the parent thread won't need to worry about the ICP
> socket and can just poll (similar to the way it currently does), but it
> only needs to do a pointer lookup instead of an actual poll().
>
> This could gain another 10%+ CPU based on our current ICP handling loads.
> This load would also translate well to another CPU.

Am I reading this right? What you've have is a child thread sleeping
on select or recvfrom or something, and every time it gets a packet,
it stuffs it in a queue, and sets a global var. Then in comm_select,
in the main execution loop, add something like
        if (icp_queue)
                run_icp_queue();
So all the processing occurs in the parent thread, but it doesn't need
to keep polling the FD's because the child will sleep on it (and
possible receive the packet as well)?

This should dramatically reduce the number of system calls
(automatically reducing CPU usage).

> *) Inlining DNS lookups.
>
> I have suitable code for building/sending/receiving/decoding of
> packets sent to a DNS server. It would just need to have a DNS socket
> open full-time to send DNS requests out/receive DNS requests and do socket
> waits on that. Alternately this operation could be pushed into a single
> DNS thhread that handles the multi-plexing operation of multiple DNS
> requests.

Is there a major win to come from this?

> *) Squid FS.
>
> There's the CNFS filesystem but that has problems because objects will get
> tossed even though they may still be needed.
>
> I've done some research on this. Squid would work well with a 4K
> frag/ 8K chunk filesystem. Basically we want to write a stripped
> down version of UFS that allows indexing by inode alone. I have a
> design here at Connect that works out to an average of 2.5 disk
> accesses per new object pulled in, and 1.7 disk accesses per cache
> hit, based on a sample taken from our live caches.
>
> Compare this to an average of approx 7 disk accesses per second with UFS on
> a new object write, and an average of 3.5 disk accesses per cache hit.
>
> This means about a 250% gain can be had from a customised UFS filesystem.

How much disk cache are you running against that?
How much do you care about portability? (would you require running on
a raw device?)
 
[...]
> *) Threading squid completely (!)
>
> I'll leave this open.

Grin. I suspect this one would be total-rewrite material. If you're
going to fully thread, then you'd get rid of all the call-back cruft
that's around at the moment, shrink the code size by a factor of 3,
and allow for full use of multi-processor machines.
 
Michael.
Received on Tue Jul 29 2003 - 13:15:51 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:51 MST