Re: more profiling from Gonzalo Arana on 2006-09-19 (squid-dev)

From: Gonzalo Arana <gonzalo.arana@dont-contact.us>
Date: Tue, 19 Sep 2006 14:12:36 -0300

On 9/19/06, Andres Kroonmaa <andre@online.ee> wrote:
> On 19 Sep 2006 at 21:10, Adrian Chadd wrote:
> > On Tue, Sep 19, 2006, Gonzalo Arana wrote:
> >
> > > Is hires profiling *that* heavy? I've used it in my production squids
> > > (while I've used squid3) and the overhead was neglible.
> >
> > It doesn't seem to be that heavy.
>
> hires profiling was designed for lightest possible
> overhead. I added special probes to measure its own
> overhead, run regularily in events (PROF_OVERHEAD) and it
> shows around 100 cpu 2.6G clock ticks on average for
> Adrian.
>
> > > There is a comment in profiling.h claiming that rdtsc (for x86 arch)
> > > stalls CPU pipes. That's not what Intel documentation says (page 213
> > > -numbered as 4-209- of the Intel Architecture Software Developer
> > > Manual, volume 2b, Instruction Reference N-Z).
> > >
> > > So, it should be harmless to profile as much code as possible, am I right?
> >
> > Thats what I'm thinking! Things like perfsuite seem to do a pretty good job
> > of it without requiring re-compilation as well.
>
> Well, this is somewhat mixed issue. Intel documented
> usage of rdtsc (at a time when I coded this) with a
> requirement of a pair of cpuid+rdtsc. Cpuid flushes all

I guess that's because time stamp counter may have different values on
different CPUs. Am I right?

> prefetch pipes and serializes execution and that is what
> they meant by stalled pipes. In my implementation I
> played around with both and found that cpuid inclusion
> added unnecessary overhead - it indeed stalls pipes. For

so, timestamp counters seem to be onsync on multiple processors.

> netburst Intel cores that can become quite notable
> overhead. So I excluded this and use simple rdtsc command
> alone. This on other hand by definition causes some
> precision error due to possible out of order execution on
> superscalar cpus, but I went on with assumption that as
> long as probe start and probe stop are similar pieces of
> code, the added time uncertainty is largely cancelling
> out as we are measuring time *between* two invocations.

Sounds reasonable to me: both the start and the stop could have (on
average) the same offset error, so they would cancel each other. I
just wonder if branch prediction does not gives us some bias in this.

> > > We could build something like gprof call graph (with some
> > > limitations). Adding this shouln't be *that* difficult, right?
> > >
> > > Is there interest in improving the profiling code this way? (i.e.:
> > > somewhat automated probe collection & adding call graph support).
> >
> > It'd be a pretty interesting experiment. gprof seems good enough
> > to obtain call graph information (and call graph information only)
> > and I'd rather we put our efforts towards fixing what we can find
> > and porting over the remaining stuff from 2.6 into 3. We really
> > need to concentrate on fixing up -3 rather than adding shinier things.
> > Yet :)
>
> How would you go on with adding call graphs without
> adding too much overhead? I think it would be hard to
> beat gprof on that one.

I'm sory, by 'adding call graph' I meant 'adding some basic call tree
information' (I've used the wrong words, sory). We could have a
matrix of profile information:
M[caller][callee]. If you wish to get deeper levels, just add a new
dimension it to the 'profile matrix': M[grandfather][father][callee].

Any way, if we trust & rely on gprof call tree, there is no point in
doing any of this.
Just as a note: why don't we trust in gprof profile information but we
do trus in gprof call graph?

Regards,

-- 
Gonzalo A. Arana

Received on Tue Sep 19 2006 - 11:12:40 MDT

This archive was generated by hypermail pre-2.1.9 : Sun Oct 01 2006 - 12:00:06 MDT