chunked_mempools - feedback needed

From: Andres Kroonmaa <andre@dont-contact.us>
Date: Fri, 27 Apr 2001 16:51:09 +0200

 I think I've reached the point with chunked pools that I'm not quite
 able to improve it further. I think its stable, and useful. But it is
 in some sense quite different from current mempools that needs to be
 discussed.

 I'd like you to try it, comment it, suggest any changes and
 improvements, and decide whether it is ok for HEAD.

 ------
 Recall what its all about, and describe what I did.

 Initial idea for chunking came from fact that malloc adds some memory
 overhead for every allocation, typically in the range of 16 bytes. It
 has to, to both satisfy one-fits-all alignment and be efficient with
 variable sized allocations. Original memPool added its own overhead by
 keeping list of pointers to idle or released objects, which is overhead
 of 1 more pointer.

 Chunks are essentially arrays of objects given memPool handles,
 allocated in relatively large sizes and then maintained by the pool
 itself. This omits malloc() overhead. To omit memPool's own overhead,
 idle or free items are not kept on a separate list, but free objects
 themselves become nodes in a linkedlist of free items. Thus free
 housekeeping is stuffed into objects themselves. Only 1 pointer is
 minimum requirement per object, which limits the minimum allocatable
 size for a memPool.

 The very purpose of chunking is to reduce useless overhead and reduce
 memory footprint of squid. With former chunked pools are quite good at,
 the latter is somewhat more questionable.

 Original memPool was able to release idle item at any time if needed,
 typically used with memory_pools_limit config tag, to limit amount of
 ram kept in idle pools. With chunking this is not possible, as memory
 can be returned only in chunks. If any chunk has only single item in
 use, the whole chunk cannot be freed even if it is never used again.

 For chunked pools, idle memory limit cannot be strict, it can only be
 a wish, which must be allowed to be violated. To be memory efficient,
 chunks must be as full as possible, to increase probability of idle
 chunks to become empty and freeable.

 Efficiency of chunked pools is dependant on memory usage pattern. I can
 imagine "chunk-busting" allocation sequence, in which alot of chunks
 are allocated but then left empty for a long time. eg. allocate alot of
 shortlived objects that are interleaved with few longlived, in such a
 way that each longlived object gets into a separate chunk. After most
 shortlived objects are released, all chunks are almost empty, but not
 releasable, giving pretty high overhead. This is unavoidable, but in
 practice very rarely happens.

 But point is that this *can* happen. With original memPool it can't.

 Important thing with chunked pools becomes handling of freeable chunk
 fragmentation. Algoritm of selecting chunks for next allocation and
 limiting chunk sizes is needed to handle that well.
 Efficiency of chunked pools is statistical - in most cases it has less
 ram overhead than original mempools, but in unfortunate scenarios it can
 add more overhead than it saves.
 Fortunately, chunks can be picked large enough to force malloc lib to
 allocate them from mmapped area. This allows idle parts of chunks to be
 paged out of the way by OS VM.

 During a free with chunked mempools free item has to be stuffed into a
 freelist chain close to other items of that chunk. It is needed to search
 and find the chunk that given released object belongs to.
 Performance for chunking was initially relatively slow, because for each
 released object upto all chunks in a pool were searched. This performance
 overhead was most for longliving objects and nearly unnoticable for very
 shortliving objects, because lastused chunk was moved ontop of the search
 chain.

 To further improve performance several changes have been made.
 Allocated chunks are stuffed into a splay tree, and compare function is
 used to compare released object to a chunk, returning a fact if given
 object belongs to a chunk. Chunk search efficiency increased.

 Then a cache of free objects was added.
 Normally, when a new chunk is created ondemand, its local freelist is
 populated with all objects as a linkedlist. When obejcts are allocated
 from a chunk, object is taken from llist head, when object is released,
 it is put ontop of that list. To omit chunk search for every released
 object, additional perpool freelist cache was added.

 Now when object is released, it is not put back into its owning chunk,
 but instead just stuffed ontop of the pool's freelist cache. Next
 allocation is made from that cache, or if this cache is empty, suitable
 chunk is searched or created.

 So, free objects can be in 1 of 2 different states:
  1) back home in the chunk they belong to (chunk's local freelist)
  2) in air, somewhere in the perpool freelist cache.
 both kinds of freelists use the objects themselves as llist nodes, so
 one object can be only in one of the lists.

 Without periodic cleanup most or all objects of the pool would end up in
 the freelist cache. This makes it impossible to know or select chunks
 that are empty and releasable.
 Therefore this global freelist cache is periodically destroyed by placing
 every object into its home chunk. For each object in free cache splay
 tree search is made and object stuffed into chunk's freelist and removed
 from the cache.
 After that all chunks are sorted by an algoritm that chooses in what
 order further allocations would be using chunks. Empty chunks are
 considered for release - if over idle pool limits then immediately, if
 within pool idle limits, then only if unreferenced for quite some time.

 Although seemingly very timeconsuming task, it is run from squid
 event only every 15secs and overall impact is nearly undetectable.

 As said above, order in which chunks are selected for next alloc can
 influence how well the chunks are utilised. I've tried few algoritms.
 Initially I preferred always chunks that reside lower in ram (compare
 pointers). This provided general tendency for squid memory usage to get
 compacted into lower heap, reducing probability of holes in heap and
 thus helped to keep process size smaller. Then I tried to prefer chunks
 least utilised, to exploit locality (new chunk gets most hits), but this
 eventually caused all chunks to get equal utilisation, and didn't allow
 any chunk to become empty. Currently it prefers chunks that are most
 filled. This is to maximise chunk utilisation.

 I haven't tested chunked pools under high real life loads, only with
 Cidera feeding one proxy (about 5-20 reqs/sec, http only). With this
 load, chunked pools behave very well, idle memory is within 2-4M with
 cache having about 2.3M objects and process size of about 192M.

 More mods.
 While chasing for maximum performance it occured to me that we needlessly
 updated memMeters upon every single alloc/free. Quite many Meters were
 updated, adding to cpu load. I changed that so that only single perpool
 counter is increased on alloc and one on free. Then, when memMeters are
 needed, and every thousand requests are gathered, counters are flushed
 into memMeters. This avoids quite alot of needless calculations.
 The only drawback is that precision of "high" usage is reduced from exact
 to "average". Average comes from lazy updating of Meters. I count frees
 and allocs, over long time, and difference is added to Meters. So sudden
 and short spikes of "high" usage can be missed.
 If that is acceptable, then this saves quite alot of cpu.

 Also, memPools main module is now under lib/. Under src we have only
 cachemgr and event handlers. So we can use memPools in modules under
 lib/ now also.

------------------------------------
 Andres Kroonmaa <andre@online.ee>
 CTO, Delfi Online
 Tel: 6501 731, Fax: 6501 708
 Pärnu mnt. 158, Tallinn,
 11317 Estonia
Received on Fri Apr 27 2001 - 08:55:40 MDT

This archive was generated by hypermail pre-2.1.9 : Tue Dec 09 2003 - 16:13:50 MST