[PATCH] rock fixes and improvements

From: Alex Rousskov <rousskov_at_measurement-factory.com>
Date: Sat, 26 Apr 2014 12:09:23 -0600

Hello,

    The attached patch contains several fixes and improvements of the
shared memory and rock disk caching code, some of them quite important.
Here is the summary (detailed change descriptions are quoted at the end
of this message):

* Bug fixes:

Avoid "FATAL: Squid has attempted to read data from memory that is not
present" crashes. Improve related code.

Lifted 16777216 slot limit from rock cache_dirs and shared memory
caches. Caches larger than 256GB (assuming default 16KB cache_dir
slot-size) require this fix to use disk space beyond 256GB. Also fixed
rock disk space waste warning.

Restored Squid ability to cache (in memory) when no disk caches are
configured which was lost during r12662 "Bug 3686: cache_dir max-size
default fails" but other bugs hid this problem.

Allow HITs on entries backed by a shared memory cache only.

Make sure Squid dumps core and not just promises one when memory
management goes wrong.

* Significant RAM usage reduction:

Significantly reduced Large Rock (and slightly shared memory) RAM
requirements by not allocating 40 (and 12) bytes of unused RAM per cache
slot.

Stop wasting 96 RAM bytes per slot for high-offset slots in large shared
caches with more than 16777216 slots. For example, a StoreMap for a 1TB
shared cache with default 16KB slot sizes (67108864 slots) occupied
about 6.5GB of RAM. After this change, the same information is stored in
about 2.0GB because unused anchors are not stored.

* Other improvements:

Document counter-intuitive round-robin cache_dir selection; decrease its
bias.

Report IpcIo file name with errors and warnings to inform admin which
cache_dir needs troubleshooting or tuning.

These changes currently reside in our bag9 branch and received some lab
and production testing. I propose to merge bag9 (r13323) into trunk
because it contains valuable commit messages and because merging will
save time. https://code.launchpad.net/~measurement-factory/squid/bag9

The attached patch was hand-edited because we undid trunk's
Vector-related changes to make bag9 usable in production. If the
proposed changes are approved, we will resync bag9 with trunk before
merging to regain those Vector changes.

Cheers,

Alex.

> ------------------------------------------------------------
> revno: 13323
> committer: Alex Rousskov <rousskov_at_measurement-factory.com>
> branch nick: bag9
> timestamp: Sat 2014-04-26 11:30:33 -0600
> message:
> Document counter-intuitive round-robin cache_dir selection bias; decrease it.
>
> Many squid.confs have at least two groups of cache_dir lines. For example,
> rare "large" objects go to larger/slower HDDs while popular "small" objects go
> to smaller/fast SSDs:
>
> # HDDs
> cache_dir rock /hdd1 ... min-size=large
> cache_dir rock /hdd2 ... min-size=large
> cache_dir rock /hdd3 ... min-size=large
> # SSDs
> cache_dir rock /ssd1 ... max-size=large-1
> cache_dir rock /ssd2 ... max-size=large-1
> cache_dir rock /ssd3 ... max-size=large-1
> # rock store does not support least-load yet
> store_dir_select_algorithm round-robin
>
> Since round-robin selects the first suitable disk during a sequential scan,
> the probability of /hdd1 (/ssd1) selection is higher than other HDDs (SSDs).
> Consider a large object that needs an HDD: /hdd1 is selected whenever scan
> starts at /ssd1, /ssd2, /ssd3, and /hdd1 while /hdd2 is selected only when the
> scan starts at /hdd2 itself! Documentation now warns against the above
> cache_dir configuration approach and suggests to interleave cache_dirs:
>
> cache_dir rock /hdd1 ... min-size=large
> cache_dir rock /ssd1 ... max-size=large-1
> cache_dir rock /hdd2 ... min-size=large
> cache_dir rock /ssd2 ... max-size=large-1
> cache_dir rock /hdd3 ... min-size=large
> cache_dir rock /ssd3 ... max-size=large-1
> store_dir_select_algorithm round-robin
>
> Squid's historic implementation of round-robin made its natural bias even
> worse because it made the starting point of the sequential scan to be the last
> selected dir. In the first configuration example above, it boosted the
> probability that the scan will start at one of the SSDs because smaller
> objects are more popular and, hence, their dirs are selected more often. With
> the starting point usually at an SSD, even more _large_ objects were sent to
> /hdd1 compared to other HDDs! The code change avoids this artificial boost
> (but the cache_dir lines should still be interleaved to avoid the natural
> round-robin bias discussed earlier).

> ------------------------------------------------------------
> revno: 13321
> committer: Alex Rousskov <rousskov_at_measurement-factory.com>
> branch nick: bag9
> timestamp: Mon 2014-04-21 12:09:06 -0600
> message:
> Stop wasting 96 RAM bytes per slot for high-offset slots in large shared caches
> with more than 16777216 slots.
>
> Ipc::StoreMap was using the same structure for all db slots. However, slots at
> offsets exceeding SwapFilenMax (16777215) could not contain store entry
> anchors and the anchor part of the structure was wasting RAM for those slots.
> This change splits a single array of StoreMapSlots into two arrays, one
> storing StoreMapAnchors and one storing StoreMapSlices. The anchors array is
> shorter for caches with more than 16777216 slots.
>
> For example, a StoreMap for a 1TB shared cache with default 16KB slot sizes
> (67108864 slots) occupied about 6.5GB of RAM. After this change, the same
> information is stored in about 2.0GB because unused anchors are not stored.
>
> 32-bit environments were wasting 72 (instead of 96) bytes per high-offset slot.
>
>
> Also simplified Ipc::StoreMap API by removing its StoreMapWithExtras part.
> The added complexity caused bugs and was not worth saving a few extra lines of
> callers code. With the StoreMap storage array split in two, the extras may
> belong to each part (although the current code only adds extras to slices),
> further complicating the WithExtras part of the StoreMap API. These extras
> are now stored in dedicated shared memory segments (*_ex.shm).
>
> Added Ipc::Mem::Segment::Name() function to standardize segment name
> formation. TODO: Attempt to convert shm_new/old API to use SBuf instead of
> char* to simplify callers, most of which have to form Segment IDs by
> concatenating strings.

> ------------------------------------------------------------
> revno: 13320
> committer: Alex Rousskov <rousskov_at_measurement-factory.com>
> branch nick: bag9
> timestamp: Sat 2014-04-19 17:05:51 -0600
> message:
> Allow HITs on entries backed by a shared memory cache only.
>
> A typo in r12501.1.59 "Do not become a store_client for entries that are not
> backed by Store" prevented such entries from being used for HITs and possibly
> even purged them from the memory cache.
> ------------------------------------------------------------
> revno: 13319
> committer: Alex Rousskov <rousskov_at_measurement-factory.com>
> branch nick: bag9
> timestamp: Fri 2014-04-18 10:57:01 -0600
> message:
> Restored Squid ability to cache (in memory) when no disk caches are configured
> which was lost during r12662 "Bug 3686: cache_dir max-size default fails"
>
> The bug was hidden until the memory cache code started calling
> StoreEntry::checkCachable() in branch r13315, exposing the entry size check to
> a broken limit.
>
>
> r12662 converted store_maxobjsize from a sometimes present disk-only limit to
> an always set Squid-global limit. However, store_maxobjsize value was only
> calculated when parsing cache_dirs. A config without cache_dirs would leave
> store_maxobjsize at -1, triggering "StoreEntry::checkCachable: NO: too big"
> prohibition for all responses.
>
> This change moves store_maxobjsize calculation from parser to storeConfigure()
> where some other Store globals are computed after squid.conf parsing.
>
> Also honored memory cache size limit (just in case cache_mem is smaller than
> maximum_object_size_in_memory) and removed leftover checks for
> store_maxobjsize being set (it should always be set, at least to zero).

> ------------------------------------------------------------
> revno: 13318
> committer: Alex Rousskov <rousskov_at_measurement-factory.com>
> branch nick: bag9
> timestamp: Wed 2014-04-09 22:56:25 -0600
> message:
> Significantly reduced Large Rock (and slightly shared memory) RAM requirements
> by not allocating 40 (and 12) bytes of unused RAM per cache slot.
>
> Rock: Stale code inherited from the Small Rock implementation allocated 40
> bytes of unused memory per rock db slot (2.5GB of RAM for a 1TB disk with 16KB
> db slots). The current (Large Rock) code stores the same kind of information
> (and more) in a different location (Ipc::StoreMap).
>
> Shared memory: Similarly, the Large Rock-based shared memory cache allocated
> 12 bytes of unused memory per shared memory cache slot (3.8MB of RAM for a
> 10GB shared RAM cache with 32KB slots).

> ------------------------------------------------------------
> revno: 13317
> committer: Alex Rousskov <rousskov_at_measurement-factory.com>
> branch nick: bag9
> timestamp: Sun 2014-04-06 15:02:18 -0600
> message:
> Lifted 16777216 slot limit from rock cache_dirs and shared memory caches.
> Also fixed rock disk space waste warning.
>
> Rock store and shared memory cache code used entry and slot limit as if
> the two were the same. In large caches, the number of entry slots may exceed
> the number of entries supported by Squid (16777216 entries per store). This
> is especially likely with large caches storing large entries (many slots
> per entry with maximum number of entries) or large caches using small slot
> sizes.
>
> AFAICT, the bug led to the "tail" of cache storage being unused and cache
> entries being purged even though there was still space to store them. Also,
> Squid was allocating smaller shared memory tables (and temporary cache index
> build tables) than actually needed for the configured cache sizes.
>
> Note that this change does not remove the per-store 16777216 entry limit.
>
>
> The old [small] rock warning about wasted disk space assumed that a cache
> entry occupies exactly one slot. The updated warnings do not.
>
>
> Also fixed and simplified db scanning condition during cache index build.

> ------------------------------------------------------------
> revno: 13316
> committer: Alex Rousskov <rousskov_at_measurement-factory.com>
> branch nick: bag9
> timestamp: Fri 2014-04-04 10:10:40 -0600
> message:
> Report IpcIo file name with errors and warnings
> to inform admin which cache_dir needs troubleshooting or tuning.
>
> Some level-0/1 messages did not mention disker ID or file name at all while
> others relied on disker process ID which is too low-level information from most
> admins point of view.
>
> Also improved error and warning messages consistency.

> ------------------------------------------------------------
> revno: 13315
> committer: Alex Rousskov <rousskov_at_measurement-factory.com>
> branch nick: bag9
> timestamp: Tue 2014-03-18 22:04:52 -0600
> message:
> Avoid "FATAL: Squid has attempted to read data from memory that is not present"
> crashes. Improve related code.
>
> Shared memory caching code was not checking whether the response is generally
> cachable and, hence, tried to store responses that Squid already refused to
> cache (e.g., release-requested entries). The shared memory cache should also
> check that the response is contiguous because the disk store may not check
> that (e.g., when disk caching id disabled).
>
> The problem was exacerbated by the broken entry dump code accompanying the
> FATAL message. The Splay tree iterator is evidently not iterating a portion of
> the tree. I added a visitor pattern code to work around that, but did not try
> to fix the Splay iterator iterator itself because, in part, the whole iterator
> design felt inappropriate (storing and flattening the tree before iterating
> it?) for its intended purpose. I could not check quickly enough where the
> broken iterator is used besides mem_hdr::debugDump() so more bugs like this
> are possible.
>
> Optimized StoreEntry::checkCachable() a little and removed wrong TODO comment:
> Disk-only mayStartSwapOut() should not accumulate "general" cachability checks
> which are used by RAM caches as well.
>
> Added more mem_hdr debugging and polished method descriptions.
>
> Fixed NullStoreEntry::mayStartSwapout() spelling/case. It should override
> StoreEntry::mayStartSwapOut().

> ------------------------------------------------------------
> revno: 13314
> committer: Alex Rousskov <rousskov_at_measurement-factory.com>
> branch nick: bag9
> timestamp: Tue 2014-03-18 16:49:20 -0600
> message:
> Make sure Squid dumps core and not just promises one
> when memory management goes wrong.

Received on Sat Apr 26 2014 - 18:09:46 MDT

This archive was generated by hypermail 2.2.0 : Mon Apr 28 2014 - 12:00:18 MDT