Re: more profiling from Andres Kroonmaa on 2006-09-19 (squid-dev)

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Tue, 19 Sep 2006 19:23:01 +0300

On 19 Sep 2006 at 21:10, Adrian Chadd wrote:
> On Tue, Sep 19, 2006, Gonzalo Arana wrote:
>
> > Is hires profiling *that* heavy? I've used it in my production squids
> > (while I've used squid3) and the overhead was neglible.
>
> It doesn't seem to be that heavy.

hires profiling was designed for lightest possible
overhead. I added special probes to measure its own
overhead, run regularily in events (PROF_OVERHEAD) and it
shows around 100 cpu 2.6G clock ticks on average for
Adrian.

> > There is a comment in profiling.h claiming that rdtsc (for x86 arch)
> > stalls CPU pipes. That's not what Intel documentation says (page 213
> > -numbered as 4-209- of the Intel Architecture Software Developer
> > Manual, volume 2b, Instruction Reference N-Z).
> >
> > So, it should be harmless to profile as much code as possible, am I right?
>
> Thats what I'm thinking! Things like perfsuite seem to do a pretty good job
> of it without requiring re-compilation as well.

Well, this is somewhat mixed issue. Intel documented
usage of rdtsc (at a time when I coded this) with a
requirement of a pair of cpuid+rdtsc. Cpuid flushes all
prefetch pipes and serializes execution and that is what
they meant by stalled pipes. In my implementation I
played around with both and found that cpuid inclusion
added unnecessary overhead - it indeed stalls pipes. For
netburst Intel cores that can become quite notable
overhead. So I excluded this and use simple rdtsc command
alone. This on other hand by definition causes some
precision error due to possible out of order execution on
superscalar cpus, but I went on with assumption that as
long as probe start and probe stop are similar pieces of
code, the added time uncertainty is largely cancelling
out as we are measuring time *between* two invocations.

I notice that someone has added rdtsc macro for windows
platform and went with the documented cpuid+rdtsc pair.
My suggestion would be to omit the cpuid. It gives
nothing in terms of precision due to stalling pipes
adding more overhead than error due to omiting it.

> > We could build something like gprof call graph (with some
> > limitations). Adding this shouln't be *that* difficult, right?
> >
> > Is there interest in improving the profiling code this way? (i.e.:
> > somewhat automated probe collection & adding call graph support).
>
> It'd be a pretty interesting experiment. gprof seems good enough
> to obtain call graph information (and call graph information only)
> and I'd rather we put our efforts towards fixing what we can find
> and porting over the remaining stuff from 2.6 into 3. We really
> need to concentrate on fixing up -3 rather than adding shinier things.
> Yet :)

How would you go on with adding call graphs without
adding too much overhead? I think it would be hard to
beat gprof on that one.

------------------------------------
Andres Kroonmaa
Elion
Received on Tue Sep 19 2006 - 10:21:46 MDT

This archive was generated by hypermail pre-2.1.9 : Sun Oct 01 2006 - 12:00:06 MDT