Re: Res: Res: [squid-users] squid 3.2.0.5 smp scaling issues from david_at

From: <david_at_lang.hm>
Date: Mon, 25 Apr 2011 16:28:03 -0700 (PDT)

On Mon, 25 Apr 2011, Marcos wrote:

> thanks for your answer David.
>
> i'm seeing too much feature been included at squid 3.x, but it's getting as
> slower as new features are added.

that's unfortunantly fairly normal.

> i think squid 3.2 with 1 worker should be as fast as 2.7, but it's getting
> slower e hungry.

that's one major problem, but the fact that the ACL matching isn't scaling
with more workers I think is what's killing us.

1 3.2 worker is ~1/3 the speed of 2.7, but with the easy availablity of 8+
real cores (not hyperthreaded 'fake' cores), you should still be able to
get ~3x the performance of 2.7 by using 3.2.

unfortunantly that's not what's happening, and we end up topping out
around 1/2-2/3 the performance of 2.7

David Lang

>
> Marcos
>
>
> ----- Mensagem original ----
> De: "david_at_lang.hm" <david_at_lang.hm>
> Para: Marcos <mczueira_at_yahoo.com.br>
> Cc: Amos Jeffries <squid3_at_treenet.co.nz>; squid-users_at_squid-cache.org;
> squid-dev_at_squid-cache.org
> Enviadas: Sexta-feira, 22 de Abril de 2011 15:10:44
> Assunto: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues
>
> ping, I haven't seen a response to this additional information that I sent out
> last week.
>
> squid 3.1 and 3.2 are a significant regression in performance from squid 2.7 or
> 3.0
>
> David Lang
>
> On Thu, 14 Apr 2011, david_at_lang.hm wrote:
>
>> Subject: Re: Res: [squid-users] squid 3.2.0.5 smp scaling issues
>>
>> Ok, I finally got a chance to test 2.7STABLE9
>>
>> it performs about the same as squid 3.0, possibly a little better.
>>
>> with my somewhat stripped down config (smaller regex patterns, replacing CIDR
>> blocks and names that would need to be looked up in /etc/hosts with individual
>> IP addresses)
>>
>> 2.7 gives ~4800 requests/sec
>> 3.0 gives ~4600 requests/sec
>> 3.2.0.6 with 1 worker gives ~1300 requests/sec
>> 3.2.0.6 with 5 workers gives ~2800 requests/sec
>>
>> the numbers for 3.0 are slightly better than what I was getting with the full
>> ruleset, but the numbers for 3.2.0.6 are pretty much exactly what I got from the
>> last round of tests (with either the full or simplified ruleset)
>>
>> so 3.1 and 3.2 are a very significant regression from 2.7 or 3.0, and the
>> ability to use multiple worker processes in 3.2 doesn't make up for this.
>>
>> the time taken seems to almost all be in the ACL avaluation as eliminating all
>> the ACLs takes 1 worker with 3.2 up to 4200 requests/sec.
>>
>> one theory is that even though I have IPv6 disabled on this build, the added
>> space and more expensive checks needed to compare IPv6 addresses instead of IPv4
>> addresses accounts for the single worker drop of ~66%. that seems rather
>> expensive, even though there are 293 http_access lines (and one of them uses
>> external file contents in it's acls, so it's a total of ~2400 source/destination
>> pairs, however due to the ability to shortcut the comparison the number of tests
>> that need to be done should be <400)
>>
>>
>>
>> In addition, there seems to be some sort of locking betwen the multiple worker
>> processes in 3.2 when checking the ACLs as the test with almost no ACLs scales
>> close to 100% per worker while with the ACLs it scales much more slowly, and
>> above 4-5 workers actually drops off dramatically (to the point where with 8
>> workers the throughput is down to about what you get with 1-2 workers) I don't
>> see any conceptual reason why the ACL checks of the different worker threads
>> should impact each other in any way, let alone in a way that limits scalability
>> to ~4 workers before adding more workers is a net loss.
>>
>> David Lang
>>
>>
>>> On Wed, 13 Apr 2011, Marcos wrote:
>>>
>>>> Hi David,
>>>>
>>>> could you run and publish your benchmark with squid 2.7 ???
>>>> i'd like to know if is there any regression between 2.7 and 3.x series.
>>>>
>>>> thanks.
>>>>
>>>> Marcos
>>>>
>>>>
>>>> ----- Mensagem original ----
>>>> De: "david_at_lang.hm" <david_at_lang.hm>
>>>> Para: Amos Jeffries <squid3_at_treenet.co.nz>
>>>> Cc: squid-users_at_squid-cache.org; squid-dev_at_squid-cache.org
>>>> Enviadas: S?bado, 9 de Abril de 2011 12:56:12
>>>> Assunto: Re: [squid-users] squid 3.2.0.5 smp scaling issues
>>>>
>>>> On Sat, 9 Apr 2011, Amos Jeffries wrote:
>>>>
>>>>> On 09/04/11 14:27, david_at_lang.hm wrote:
>>>>>> A couple more things about the ACLs used in my test
>>>>>>
>>>>>> all of them are allow ACLs (no deny rules to worry about precidence of)
>>>>>> except for a deny-all at the bottom
>>>>>>
>>>>>> the ACL line that permits the test source to the test destination has
>>>>>> zero overlap with the rest of the rules
>>>>>>
>>>>>> every rule has an IP based restriction (even the ones with url_regex are
>>>>>> source -> URL regex)
>>>>>>
>>>>>> I moved the ACL that allows my test from the bottom of the ruleset to
>>>>>> the top and the resulting performance numbers were up as if the other
>>>>>> ACLs didn't exist. As such it is very clear that 3.2 is evaluating every
>>>>>> rule.
>>>>>>
>>>>>> I changed one of the url_regex rules to just match one line rather than
>>>>>> a file containing 307 lines to see if that made a difference, and it
>>>>>> made no significant difference. So this indicates to me that it's not
>>>>>> having to fully evaluate every rule (it's able to skip doing the regex
>>>>>> if the IP match doesn't work)
>>>>>>
>>>>>> I then changed all the acl lines that used hostnames to have IP
>>>>>> addresses in them, and this also made no significant difference
>>>>>>
>>>>>> I then changed all subnet matches to single IP address (just nuked /##
>>>>>> throughout the config file) and this also made no significant difference.
>>>>>>
>>>>>
>>>>> Squid has always worked this way. It will *test* every rule from the top down
>>>>> to the one that matches. Also testing each line left-to-right until one fails or
>>>>> the whole line matches.
>>>>>
>>>>>>
>>>>>> so why are the address matches so expensive
>>>>>>
>>>>>
>>>>> 3.0 and older IP address is a 32-bit comparison.
>>>>> 3.1 and newer IP address is a 128-bit comparison with memcmp().
>>>>>
>>>>> If something like a word-wise comparison can be implemented faster than
>>>>> memcmp() we would welcome it.
>>>>
>>>> I wonder if there should be a different version that's used when IPv6 is
>>>> disabled. this is a pretty large hit.
>>>>
>>>> if the data is aligned properly, on a 64 bit system this should still only be 2
>>>> compares. do you do any alignment on the data now?
>>>>
>>>>>> and as noted in the e-mail below, why do these checks not scale nicely
>>>>>> with the number of worker processes? If they did, the fact that one 3.2
>>>>>> process is about 1/3 the speed of a 3.0 process in checking the acls
>>>>>> wouldn't matter nearly as much when it's so easy to get an 8+ core system.
>>>>>>
>>>>>
>>>>> There you have the unknown.
>>>>
>>>> I think this is a fairly critical thing to figure out.
>>
>
Received on Mon Apr 25 2011 - 23:28:04 MDT

This archive was generated by hypermail 2.2.0 : Tue Apr 26 2011 - 12:00:03 MDT