oprofile traces (or "you're doing what with the l2 cache?")

From: Adrian Chadd <adrian@dont-contact.us>
Date: Mon, 24 Dec 2007 13:10:12 +0900

I've been fiddling with oprofile some more, and here's some interesting stuff.
Comments/questions/etc are welcome.

Here's a series of traces from my PIII-633 Celeron test box. Faster boxes have
different looking traces but the main culprits are the same.

CPU: PIII, speed 634.478 MHz (estimated)
Counted IFU_MEM_STALL events (cycles instruction fetch pipe is stalled) with a unit mask of 0x00 (No unit mask) count 7500
samples % image name symbol name
270985 3.6905 libc-2.3.6.so vfprintf
175671 2.3924 libc-2.3.6.so re_search_internal
161464 2.1990 libc-2.3.6.so _int_malloc
111337 1.5163 squid comm_select

This is "where the instruction pipeline seems to stall the most", and I believe
that translates to "because its waiting for data to come along to satisfy
what the instructions need."

Compare that to:

CPU: PIII, speed 634.478 MHz (estimated)
Counted L2_LD events (number of L2 data loads) with a unit mask of 0x0f (All cache states) count 10000
samples % image name symbol name
19592 25.1315 libc-2.3.6.so _int_malloc
10603 13.6009 libc-2.3.6.so memcpy
3051 3.9136 libc-2.3.6.so malloc_consolidate
2201 2.8233 squid memPoolAlloc
1856 2.3808 squid httpHeaderIdByName

Now, a large part of that _int_malloc is because I'm malloc'ing twice for every
new string, rather than once - but I'm also not malloc'ing at all for strings
that are cloned (from say, http headers) and quite a bit of that is going on now
in my non-disk tests:

(from mgr:strings) :

Create: 50227639
Dup: 33111752
Copy-On-Modify: 0

Next: what about L2 data stores:

CPU: PIII, speed 634.478 MHz (estimated)
Counted L2_ST events (number of L2 data stores) with a unit mask of 0x0f (All cache states) count 10000
samples % image name symbol name
8890 29.3206 libc-2.3.6.so memcpy
2342 7.7243 libc-2.3.6.so _int_malloc
1183 3.9017 libc-2.3.6.so vfprintf

Well, we still see the same culprits chewing CPU:

* vfprintf
* memcpy
* malloc

Now, memcpy comes up in two main places:
  908 6.6099 squid memBufGrow
  1032 7.5126 squid httpHeaderEntryPackInto
  4737 34.4835 squid httpProcessReplyHeader
  6165 44.8788 squid storeAppend
13737 13.6146 libc-2.3.6.so memcpy

The big two users are in http.c - the reply header copy'ing and the path into the
storage manager.

Next: malloc; which goes via xmalloc:

  634 3.2385 squid storeSwapTLVAdd
  1252 6.3953 squid storeSwapMetaPack
  2748 14.0369 squid memReallocBuf
  14214 72.6056 squid stmemAppend
35 0.0347 squid xmalloc

Since there's plenty of stmem churn - especially when squid is proxy-only as in my workload -
I'm guessing this is the MEM_MEM_NODE allocation.

vfprintf and friends are a bit convoluted to find, but basically:

* lots of library functions end up there - like inet_ntoa()
* its in the critical path - httpHeaderPutStrf() and urlCanonical() for two good examples

So, summarising:

* vfprintf, malloc and memcpy are three ideal candidates to try and eliminate where possible;
* but in a lot of cases (now!) there's no single glaringly inefficient routine - its
  systemic throughout Squid..
* of course, eliminating things which show up with high percentages won't have -that-
  much of an impact (right now) eg - some of the L2 loads via memcpy for example are priming
  the cache with data thats about to be used elsewhere "real soon now" and thus eliminating
  the memcpy will shift the stalling elsewhere; it will reduce the amount of memory bandwidth
  being used though!

Adrian

-- 
- Xenion - http://www.xenion.com.au/ - VPS Hosting - Commercial Squid Support -
Received on Sun Dec 23 2007 - 21:02:59 MST

This archive was generated by hypermail pre-2.1.9 : Mon Dec 31 2007 - 12:00:03 MST