Re: more profiling from andre@dont-contact.us on 2006-09-18 (squid-dev)

From: <andre@dont-contact.us>
Date: Mon, 18 Sep 2006 19:46:12 +0300

On 15 Sep 2006 at 10:45, Henrik Nordstrom wrote:

> fre 2006-09-15 klockan 16:25 +0800 skrev Adrian Chadd:
> > I've done some more profiling. Same test setup, current squid3 source.
> >
> > perfmon output:
> >
> > --------------------------------------------------------------------------------
> > Samples Self % Total % Function
> >
> > 18915 26.09% 26.09% MemPool::clean(long)
> > 9266 12.78% 38.86% MemPool::getStats(MemPoolStats*)
>
> Thats scary indeed.. My guess is that it would be worth the effort
> adding a little header to each object indicating their home chunk,
> avoiding the need for this rather heavy garbage collect.
>
> Why getStats is as heavy is a little mystery, but I suppose the clean
> call got inlined or something.. or maybe the chunk size is simply poorly
> selected which bites both. I also wonder a bit why it got called as
> often.
>
> Do you have any very large pools in this workload?
>
> Would also be interesting to compare to a benchmark with the non-pooled
> allocator. Simply start Squid with MEMPOOLS=0 in the environment.

Hi guys,

What are you trying here? Are you seriously trying again to use gprof for
profiling squid? What you see isn't right. Gprof can't get you right
insight into squid performance issues.

It is accounting _excessive_ amount of cpu time to malloc routines, simply
because its the most often called func, runs way faster than best gprof
timing precision, and gprof has to use _assumptions_ to weight cpu usage.
And it _assumes_ that if a func is called 100 times more often than any other,
it has to gather the most cpu time overall. Also, gprof adds 100 times more
overhead than what it is trying to measure.

Chunked malloc get/push have been performance tuned to limits, and unless
broken sometime, consume way less cpu than any next best functional subroutine.
The only issue to be solved was adding intelligence to clear xcalloced buffers
only when really needed, else mem bw overhead is unavoidable.
The mere fact that gprof shows such high cpu utilisation for malloc put/push
is telling me its misleading.

http://devel.squid-cache.org/old_projects.html#cpuProfiling
http://www.online.ee/~andre/squid/profil/

I've put quite some effort into this toolset to do exactly profiling.
If used right, it produces real data with CPU clock tick precision,
usable for code optimizations.
It is usable with optimized compiles, is lightweight enough to be enabled
on production caches.

Remember it? It was discussed here quite alot, and was included into squid3.

Simply starting Squid with MEMPOOLS=0 in the environment will show you
false results and give false sense of chunked allocator heaviness. There is
only one way to profile squid under load - use *very* lightweight probes with very
high precision. Some OS'es give you kernel tracing, but its also heavier
than specific compiled in probes.

Try the hires profiling stuff. I hope you'll like it.
I'd on other hand would like to actually see it being used on purpose, for once ;)
Received on Mon Sep 18 2006 - 11:16:59 MDT

This archive was generated by hypermail pre-2.1.9 : Sun Oct 01 2006 - 12:00:06 MDT