Re: Squid performance wish-list from Stewart Forster on 1998-08-23 (squid-dev)

From: Stewart Forster <slf@dont-contact.us>
Date: Mon, 24 Aug 1998 11:46:42 +1000

> Stewart Forster <slf@connect.com.au> writes:
> >
> > *) Priority queues in select() loop:
> [ .. ]
> > Half-closed connections
> >
> > It would seem that squid spends possibly a lot of time rechecking
> > for activity
> > on sockets that may not need it. For example, if we've just sent a request
> > to a server, we probably won't expect a reply back immediately so that server
> > socket should be exempted from a poll check until say 3 times around the
> > poll loop.
>
> This is probably a bad example. You want to respond quickly to fast
> machines, and slowly to slow machines. What might be more useful is to
> work out something like "If it wasn't ready last time, it's probably
> not ready now, so skip it once. If it's still not ready, skip it
> twice". etc etc.

Yes, it probably was a bad example, but you're right, I was intended more
on the respond quickly to fast, slowly to slow. Something like the current
HTTP poll rates we have now.

> The idea being that 80% of your activity occurs on 20% of your
> connections, so you want to keep checking active ones, but slow down
> the checking on sleeping ones.

Yep, that's the idea.

> > Idle connections (while not polled explicity) still have to scanned each
> > time through the select() loop to see if they have a handler attached. This
> > seems like a waste. We should only be putting sockets onto the select list
> > if they are on a list of possible sockets, not scanning each time through to
> > find which ones.
> >
> > I predict that with an efficient select() loop mechanism about 10% CPU could
> > be saved.
>
> The huge one here is incremental updates of the select/poll
> structures. It's pretty silly to have nice pretty functions that all
> the handler updates go thru, but then every call to comm_select()
> searching the entire list to find which ones have handlers.
>
> Adding a couple of global fd_set's (pollfd's) which are updated every
> time a handler is added/deleted, and then having comm_select do
> fd_set tmp_read = global_read;
> is a hell of a lot faster than running a loop over every fd_table
> entry. (i.e. a straight memory copy instead of rebuilding it from
> scratch).

You got it. Perhaps even having a doubly linked list set of nodes that
can be index by FD number so they can easily be moved about a set of queues
whenever required. That way the poll loop simply traverses this list of
possibly ready FDs and builds the poll array. There can then be multiple
lists like this and with every select loop lists move up one level towards
the top list (the top list is the one that the poll array is built from).
This sort of list manipulation and merging is very fast and simple and would
provide an easy way to manage stalling slow FDs for a bit, or possibly even be
used as a rate limiting device.

That was my idea for priority queues in any event. commSetSelect should
then have a extra argument for priority level so when a callback is
registered the FD gets put onto the appropriate list.

> > *) Capping number of sockets handled per select() loop.
> >
> > With a large select loop, it may be possible that it takes a long time to
> > get to te end and call the next eventRun(). This can hinder squid's cleanup
> > and maintenance mechanisms. Perhaps breaking up LARGE select chunks into a
> > few smaller ones would be good.
>
> What are you trying to optimize for here? This sounds like a latency
> optimization. It would probably be a better idea to have the squid
> cleanup/maintenance in a seperate thread or such like. Even just
> having the wakeups in a seperate thread, and have the comm_select()
> execution loop poll the global variable.

Yes, latency optimisation is what I was after here. A thread would be good
but I'm now concerned about the amount of locking that would require.

An alternative is as Andres said, have the select loop be restartable and
check for the next event time a little more often.

How about this possibility - have the eventRun stuff was called by the main
select loop instead of the main.c main loop. While scanning through the list
of FD's checking for what is active, every N fd's checked do a timecheck and
call eventRun when needed. How about that?

> Or better yet, just make the cleanup/maintain routines notice how much
> work is actually needed, rather than assuming they'll be called a
> minimum of X times per time period.

This is another possibility, but I'd then be concerned about the event stuff
starting to hog CPU and make general operation appear to stall for periods.

> > *) Moving ICP query responder/receiver to a sub-thread.
> >
> > This will be a BIG win by removing the ICP poll out of the main thread and
> > the associated processing that goes with it. This function has two
> > operations.
>
> Yes, this is long overdue. I seem to recall there is some headache
> here w.r.t walking the global structures. Doesn't some locking need to
> be introduced or something?
>
> Hmm. You might be able to play fast and loose by versioning the global
> structures, and having the ICP thread check the version often
> (i.e. every time it would dereference a pointer, it checks the
> version, and if it's changed, restarts).

I'm glad that most people seem to agree on this. Adres is concerned about
DOS attacks and in some ways I agree. It's no good accepting ICP too fast
if the main thread can't handle the load. There would definately still
need to be some interaction between the two threads to perform some rate
limiting. As to what form that would take I haven't thought about yet.

> > 1) Sending ICP queries and receiving replies
> >
> > 2) Responding to ICP queries
> >
> > By breaking this out the parent thread won't need to worry about the ICP
> > socket and can just poll (similar to the way it currently does), but it
> > only needs to do a pointer lookup instead of an actual poll().
> >
> > This could gain another 10%+ CPU based on our current ICP handling loads.
> > This load would also translate well to another CPU.
>
> Am I reading this right? What you've have is a child thread sleeping
> on select or recvfrom or something, and every time it gets a packet,
> it stuffs it in a queue, and sets a global var. Then in comm_select,
> in the main execution loop, add something like
> if (icp_queue)
> run_icp_queue();

For parsing ICP replies from other caches that this cache has sent ICP queries
to, then yes.

The thread can completely handle ICP queries from other caches and replying
back to them by itself. Again, this operation may need some rate limiting.

Whether the above two operations are handled by the parent completely for
ICP querying and thread to ICP replying, or a thread for each, or a single
thread for both is open for experimentation.

> So all the processing occurs in the parent thread, but it doesn't need
> to keep polling the FD's because the child will sleep on it (and
> possible receive the packet as well)?
>
> This should dramatically reduce the number of system calls
> (automatically reducing CPU usage).

That's the goal.

> > *) Inlining DNS lookups.
> >
> > I have suitable code for building/sending/receiving/decoding of
> > packets sent to a DNS server. It would just need to have a DNS socket
> > open full-time to send DNS requests out/receive DNS requests and do socket
> > waits on that. Alternately this operation could be pushed into a single
> > DNS thhread that handles the multi-plexing operation of multiple DNS
> > requests.
>
> Is there a major win to come from this?

I think Andres answered this one well.

> > *) Squid FS.
> >
> > There's the CNFS filesystem but that has problems because objects will get
> > tossed even though they may still be needed.
> >
> > I've done some research on this. Squid would work well with a 4K
> > frag/ 8K chunk filesystem. Basically we want to write a stripped
> > down version of UFS that allows indexing by inode alone. I have a
> > design here at Connect that works out to an average of 2.5 disk
> > accesses per new object pulled in, and 1.7 disk accesses per cache
> > hit, based on a sample taken from our live caches.
> >
> > Compare this to an average of approx 7 disk accesses per second with UFS on
> > a new object write, and an average of 3.5 disk accesses per cache hit.
> >
> > This means about a 250% gain can be had from a customised UFS filesystem.
>
> How much disk cache are you running against that?
> How much do you care about portability? (would you require running on
> a raw device?)

I haven't actually run up a cache yet, I've designed and theoretically run
some numbers through the design. It's my intention to get some working code
up this week.

Stew.

-- 
Stewart Forster (Snr. Development Engineer)
connect.com.au pty ltd, Level 9, 114 Albert Rd, Sth Melbourne, VIC 3205, Aust.
Email: slf@connect.com.au   Phone: +61 3 9251-3684   Fax: +61 3 9251-3666

Received on Tue Jul 29 2003 - 13:15:52 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:11:52 MST