Re: CPU profiling internally in squid

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Fri, 29 Sep 2000 11:56:53 +0200

On 29 Sep 2000, at 8:18, Henrik Nordstrom <hno@hem.passagen.se> wrote:

> And what makes it different from simply using gprof to get automatic
> profiling of all functions?

 hmm, what makes it similar? gprof is useless too often. It works totally
 differently. Time between any func start/stop is not measured. such time
 is not measured at all. What is measured, is how many times any given func
 is called during every timeslice given to process. If func is called during
 any given timeslice, then this func is attributed that timeslice. If many
 funcs are called during same timeslice, they all receive equal share from
 that timeslice. Timeslice granularity is 10ms on typical system. If one
 func is called 80 times during that timeslice, and other func 20 times,
 then one will be attributed 8ms and other 2ms although it might have
 taken 9ms of real time instead of 2ms.
 Given two functions that both are called equally by counts, will show up
 equal cpu usage, even if one takes 10 times more time to complete than the
 other. The only requirement is that both must be present equally in every
 timeslice.
 This is totally statistical stuff, and doesn't show you at all, how much
 time is _really_ spent in any func, only approximations. If you want to
 look closer, gprof isn't good enough.

 Besides, unless you use profiling system libs, time spent in syscalls are
 left unaccounted for, and all time spent is attributed to func calling
 that syscall. Given func that calls read+write, you won't be able to make
 difference, which takes more time: read or write.
 This general profiling might be very useful tool, but it is quite more
 complex and does have alot more overhead. You don't run gprofing squid
 on your production caches, do you?
 To get gprof output, you have to stop the process. You'll get only total
 summaries. You can't see whats eating CPU in squid _now_, when it happens,
 you can't see what was CPU usage distribution 1 sec ago, 5 sec ago, how
 it changes dynamically in time. When you have a sudden spike in CPU usage,
 you can't go and see what the hell is eating it.
 You can't measure how much is preparing poll FD's take time in comm_select
 compared to all other stuff.
 You can't use profiling with async-io enabled, gprof simply brakes.
 Without async-io you are profiling totally different Squid.
 I've always had dramatic problems with gprof. Most of the time you don't
 get prof output, gprof segfaults or squid under profiling segfaults.

 What this patch gives you is a way to place specific probes in place where
 you want them, not everywhere. It measures absolute real time of every single
 run of a section of code. You are not limited with functions, you can
 measure pieces of code within functions (for loops). Averaging is done by
 summing these absolute real times and counts, and comparing that to absolute
 real time between averaging runs. You have precise cpu usage averaged,
 not some weird statistical approximations.

 I've written it partly to see what it could be good for. I'm specifically
 happy that it can measure CPU ticks and not system ticks. I'm happy that
 it can measure sections of code that take under 500 CPU clock ticks to run
 AND compare relative times of different implementations of such code.
 You can actually optimise code by changing small bits, recompiling and
 see how it makes sense. Whatever.

 I've got somewhat surprising (to me) results. I was expecting to see that
 memory copying functions add up to some notabale cpu time (my cpu has
 100MHz ram bus, vs 800Mhz cpu). I was also expecting that storeGet would
 take measurabale time to find object from 2.5M. I also was worried that
 too frequent malloc/free wastes cpu. Instead I've found that pollFD
 preparation is the main cpu-hog, that comm_reads takes 4 times the time
 comm_writes take, and that ACL matching is next cpu-hog (aclCheckFast isn't
 that fast by taking 7-10% of total CPU on the system)

 What makes it really different, is that you can run production caches
 with it forever, when finding func that eats cpu too much, go modify
 source, place probes closer to suspect sections, optimise the sections,
 and recompile. I think its much easier to use.

> Andres Kroonmaa wrote:
> >
> > Disregard the previous patch. I've managed to solve my problems
> > and updated patch is located at:
> > http://www.online.ee/~andre/squid/profil/
> > It integrates cleanly with 2.3S4-hno-CVS, adds configure option
> > --enable-cpu-profiling, and pretty much works ;)
> >
> > I've not heard any comments on it, no good, no bad. please.
> >
> > On 22 Sep 2000, at 11:48, Andres Kroonmaa <andre@online.ee> wrote:
> >
> > >
> > > I've put together some profiling tools into squid, and I'd
> > > like to through it out for your review.
> > > The patch is clumsy, like my coding skills, and I had some
> > > trouble with integrating it nicely.
> > > It compiled ok when I patched and recompiled in already compiled
> > > tree, but it seems to fail now due to different approach needed
> > > to make both src/ and lib/ happy.
> > > This is something I'm not good at.
> > >
> > > Description.
> > > Patch adds new src file, profiling.c where profiling probes are
> > > implemented. include/profiling.h defines a macro, that can be
> > > inserted anywhere in the code, and be used to fire off and stop
> > > a measurement. Precision is ca 1 cpu clock tick. Works best for
> > > Intel (and only pentiums up) (this is where I'm on), I know it
> > > should work with small fixes on alpha, and for sparc some code
> > > needs to be added.
> > >
> > > I've added probes in several places in the source, to have some
> > > results. These can be seen at:
> > > http://www.online.ee/~andre/squid/profil/
> > > http://www.online.ee/~andre/squid/profil/sample-longterm.html
> > > and patch itself:
> > > http://www.online.ee/~andre/squid/profil/prof.diff
> > >
> > > Patch adds new menu to cachemgr, and keeps history stats for
> > > last 1,5,30 secs and 1,5,30 minutes.
> > >
> > > probes are measuring delta ticks between invocations, and
> > > accumulate this time into a structure. Every 1 seconds, from
> > > Event management, summary is generated for each historical
> > > period. Time is measured between event runs, 1sec, and time
> > > accumulated by probes is compared to it, resulting percentage
> > > of CPU use. Best, average and worst cases are also noted. In
> > > general, I believe this allows us to have a look at what is
> > > eating CPU inside squid. CPU clock timer moves on globally
> > > for the whole system, so all non-squid time is unaccounted
> > > and may be attributed wrongly to some probes, upping worst
> > > case dramatically.
> > >
> > > This patch seems stable. It has some functional problems: it
> > > cannot handle overlapping probes of same name (it would get timing
> > > wrong), it tries to measure "idle" time between different probes,
> > > (ie time when no probes in running state) and it doesn't workout
> > > well. Also, because several probes I've put into code do nest or
> > > overlap, totals are also off.
> > > Average added overhead seems to be about 650 ticks per probecall,
> > > which can be optimised somewhat, at a cost of smaller convenience.
> > >
> > > One more thing I wanted to ask you gurus. I've implemented probe
> > > matching by way of string pointers. To start a probe, you give a
> > > const char string as a parameter to a probe, it records the pointer
> > > to it, and subsequently compares not string itself, but the pointers.
> > > This allows for fast check, and the string itself is used as a
> > > description. Seems also to work ok. I understand that it could
> > > brake if probe with the same name is started in one module, and
> > > stopped in another. Then const char strings would have different
> > > pointers? This could result in the probes with the same name to
> > > be seen as different. This should also be immediately seen in
> > > cachemgr output. Or is such approach acceptable, generally?
> > >
> > > Please review the sample output, and the patch itself, and comment.
> > > Also, hints on how to integrate it properly, or (re)design hints
> > > are very welcome.
> > >
> > > PS. alpha timing macro is there, but I have no acces to alpha right
> > > now, so I cannot test. It would be helpful, if someone with skills
> > > to get it compiled could test it. Also, if other cpu-types can be
> > > added, would be very nice. All that is needed is a func or macro
> > > that is able to read out CPU onchip tick-counter, if supported.
> > > Actually, profiling code doesn't know or care in what units it
> > > is measuring. the only thing it assumes is that counter between
> > > sample is incrementing uniformly. It could be actually microsecs,
> > > nanosec, or whatever. For some CPU's that have 32bit counters
> > > (alpha, isn't it weird that 64bit cpu has 32bit counters?), it
> > > may wrap in few seconds. Currently, no code for handling that.
> > >
> > > PPS. preliminary results hints that squid spends considerable amount
> > > of time in preparing FD's for poll (setting up pfds.events), near
> > > 10% of total CPU time under about 50 reqs/sec load.
> > > Also, ACL matching seems comsuming alot, also approaching 10%.
> > > The rest is open for debate.
> >
> > ------------------------------------
> > Andres Kroonmaa <andre@online.ee>
> > Delfi Online
> > Tel: 6501 731, Fax: 6501 708
> > Pärnu mnt. 158, Tallinn,
> > 11317 Estonia
>

------------------------------------
 Andres Kroonmaa <andre@online.ee>
 Delfi Online
 Tel: 6501 731, Fax: 6501 708
 Pärnu mnt. 158, Tallinn,
 11317 Estonia
Received on Fri Sep 29 2000 - 03:59:53 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:12:38 MST