CPU profiling internally in squid

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Fri, 22 Sep 2000 11:48:22 +0200

 I've put together some profiling tools into squid, and I'd
 like to through it out for your review.
 The patch is clumsy, like my coding skills, and I had some
 trouble with integrating it nicely.
 It compiled ok when I patched and recompiled in already compiled
 tree, but it seems to fail now due to different approach needed
 to make both src/ and lib/ happy.
 This is something I'm not good at.

 Description.
 Patch adds new src file, profiling.c where profiling probes are
 implemented. include/profiling.h defines a macro, that can be
 inserted anywhere in the code, and be used to fire off and stop
 a measurement. Precision is ca 1 cpu clock tick. Works best for
 Intel (and only pentiums up) (this is where I'm on), I know it
 should work with small fixes on alpha, and for sparc some code
 needs to be added.

 I've added probes in several places in the source, to have some
 results. These can be seen at:
  http://www.online.ee/~andre/squid/profil/
  http://www.online.ee/~andre/squid/profil/sample-longterm.html
 and patch itself:
  http://www.online.ee/~andre/squid/profil/prof.diff

 Patch adds new menu to cachemgr, and keeps history stats for
 last 1,5,30 secs and 1,5,30 minutes.

 probes are measuring delta ticks between invocations, and
 accumulate this time into a structure. Every 1 seconds, from
 Event management, summary is generated for each historical
 period. Time is measured between event runs, 1sec, and time
 accumulated by probes is compared to it, resulting percentage
 of CPU use. Best, average and worst cases are also noted. In
 general, I believe this allows us to have a look at what is
 eating CPU inside squid. CPU clock timer moves on globally
 for the whole system, so all non-squid time is unaccounted
 and may be attributed wrongly to some probes, upping worst
 case dramatically.

 This patch seems stable. It has some functional problems: it
 cannot handle overlapping probes of same name (it would get timing
 wrong), it tries to measure "idle" time between different probes,
 (ie time when no probes in running state) and it doesn't workout
 well. Also, because several probes I've put into code do nest or
 overlap, totals are also off.
 Average added overhead seems to be about 650 ticks per probecall,
 which can be optimised somewhat, at a cost of smaller convenience.

 One more thing I wanted to ask you gurus. I've implemented probe
 matching by way of string pointers. To start a probe, you give a
 const char string as a parameter to a probe, it records the pointer
 to it, and subsequently compares not string itself, but the pointers.
 This allows for fast check, and the string itself is used as a
 description. Seems also to work ok. I understand that it could
 brake if probe with the same name is started in one module, and
 stopped in another. Then const char strings would have different
 pointers? This could result in the probes with the same name to
 be seen as different. This should also be immediately seen in
 cachemgr output. Or is such approach acceptable, generally?

 Please review the sample output, and the patch itself, and comment.
 Also, hints on how to integrate it properly, or (re)design hints
 are very welcome.

PS. alpha timing macro is there, but I have no acces to alpha right
 now, so I cannot test. It would be helpful, if someone with skills
 to get it compiled could test it. Also, if other cpu-types can be
 added, would be very nice. All that is needed is a func or macro
 that is able to read out CPU onchip tick-counter, if supported.
 Actually, profiling code doesn't know or care in what units it
 is measuring. the only thing it assumes is that counter between
 sample is incrementing uniformly. It could be actually microsecs,
 nanosec, or whatever. For some CPU's that have 32bit counters
 (alpha, isn't it weird that 64bit cpu has 32bit counters?), it
 may wrap in few seconds. Currently, no code for handling that.

PPS. preliminary results hints that squid spends considerable amount
 of time in preparing FD's for poll (setting up pfds.events), near
 10% of total CPU time under about 50 reqs/sec load.
 Also, ACL matching seems comsuming alot, also approaching 10%.
 The rest is open for debate.

------------------------------------
 Andres Kroonmaa <andre@online.ee>
 Delfi Online
 Tel: 6501 731, Fax: 6501 708
 Pärnu mnt. 158, Tallinn,
 11317 Estonia
Received on Fri Sep 22 2000 - 03:51:41 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:38 MST