Re: Squid-3.2 status update from Amos Jeffries on 2012-07-04 (squid-dev)

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Thu, 05 Jul 2012 11:34:46 +1200

On 05.07.2012 10:00, Alex Rousskov wrote:
> On 06/27/2012 03:12 AM, Amos Jeffries wrote:
>
>> A quick review of the other major bugs shows that each will take
>> some
>> large design and code changes to implement a proper fix or even a
>> workaround.
>>
>>
>> Are there any objections to ignoring these bugs when considering a
>> 3.2
>> stable release:
>
> Our definition of a "stable release" has two criteria:
>
> 1. "Meant for production caches."
>
> 2. "begin when all known major bugs have been fixed [for 14 days]."
>
> Criterion #1 should probably be interpreted as "Squid Project
> considers
> the version suitable for production deployment". If you think we are
> there, I have no objections -- I do not have enough information to
> say
> whether enough users will be satisfied with current v3.2 code in
> production today. Perhaps this is something we should ask on
> squid-users
> after we close all bugs that we think should be closed?
>
> As for Criteria #2, your question means that either we stop
> considering
> those bugs as major OR we change criterion #2. IMHO, we should adjust
> that criterion so that we do not have to play these games where we
> mark
> something as a major bug but then decide that in the interest of a
> speedier "stable" designation we are going to "ignore" it.
>
> An adjusted initialization criteria could be phrased as
>
> 2'. "begin when #1 is satisfied for at least 14 days"
>
>
> This gives us enough flexibility to release [what we consider
> suitable-for-production] code that might have major bugs in some
> environments. I added "at least" because otherwise we may have to
> release v3.3 as stable 14 days after v3.2 is marked stable :-). In
> practice, the version should have "enough improvements" to warrant
> its
> numbering and its release but I do not want to digress in that
> discussion.
>
>
>
>> 3124 - Cache manager stops responding when multiple workers used
>> ** requires implementing non-blocking IPC packets between workers
>> and
>> coordinator.
>
> Has this been discussed somewhere? IPC communication is already
> non-blocking so I suspect some other issue is at play here. The
> specific
> examples of mgr commands in the bug report (userhash, sourcehash,
> client_list, and netdb) seem like non-essential in most environments
> and, hence, not justifying the "major" designation, but perhaps they
> indicate some major implementation problem that must be fixed.
>

UNIX sockets apparently guarantee the write() is blocked until
recipient process has read() the packet. Meaning each IPC packet is
blocked behind whatever longer AsyncCall or delay the recipient has
going on. Last I looked the coordinator handling function also called
component handler functions synchronously for them to create the
response IPC packet.

AFAIK this is waiting on the Subscription and generic (immediate-ACK)
IPC packets, which will free up the coordinator and workers for other
async operations even if a large process is underway.

>
>> 3389 - Auto-reconnect for tcp access_log
>> ** requires asynchronous handling of log opening and blocking
>> Squid
>> operation
>
> Since we have stable file-based logging, this bug does not have to
> block
> a "stable" designation if TCP logging is declared "experimental". You
> already have a patch that addresses 90% of the core problem for those
> who care.
>
> If you do not want to mark TCP logging as experimental and highlight
> this shortcoming, then the bug ought to be fixed IMHO because there
> is
> consensus that accurate logging is critical for many deployments.
>
>
>> 3478 - Host verify catching dynamic CDN hosted sites
>> ** requires designing a CONNECT and bump handling mechanism
>
> I am not an expert on this, but it feels like we are trying to
> enforce a
> [good] rule ignored by the [bad] real world, especially in
> interception
> environments. As a result, Squid lies and scares admins for no good
> reason (in most cases). We will not win this battle.
>
> I suggest that the "host_verify_strict off" behavior is adjusted to
> cause no harm, even if some malicious requests will get through.
>

It does that now. The "no harm" means we can't re-write the request
headers to something we are not sure about and would actively cause
problems if we got it wrong.
The current state is that Squid goes DIRECT, instead of through peers.
Breaking interception+cluster setups.

I can open that up again, but it will mean updating the CVE to indicate
2nd-stage proxies are still vulnerable.

> If you do not want to do that, please add a [fast] ACL so that admins
> are not stuck without a solution and can whitelist bad (or all)
> sites.
>
>
> Said that, the bug report itself does not explicitly say that
> something
> is _seriously_ broken, does it? I bet the cache.log messages are
> excessive on any busy site with a diverse user population, but we can
> rate-limit these messages and downgrade the severity of the bug while
> waiting for a real use case where these new checks break things
> (despite
> host_verify_strict being off).
>

cache_peer relay is almost completely "disabled" for some major sites.
Everything else works well.

>
>> 3517 - Workers ldap digest
>> ** requires SMP atomic access support for all user credentials
>
> This is not a blocker IMO. SMP has several known limitations, complex
> authentication schemes being one of them. This does not affect
> stability
> of supported SMP configurations.
>

Okay, thank you.

>
>> Which would leave us with only these to locate (any takers?) :
>>
>> 3551 - store_rebuild.cc:116: "store_errors == 0" assertion
>
> It would be nice to figure this one out, at least for ufs, because
> many
> folks will try ufs with SMP and there is clearly some kind of
> corruption
> problem there. I assigned the bug to self for now.
>
> However, if I cannot reproduce it, I will not be able to make much
> progress. Please note that the original reported moved on to rock
> store
> and does not consider this bug to be affecting him any more (per
> comment
> #10).
>
>
>> 3556 - assertion failed: comm.cc:1093: "isOpen(fd)"
>
> I recommend adding a guard for the comm_close() call in the
> Connection
> destructor to avoid the call for !isOpen(fd) orphan connections. And
> print the value of isOpen() in the BUG message.
>

Aha.

>
>> 3562 - StoreEntry::kickProducer Segmentation fault
>
> I suspect Squid is corrupting its own memory somewhere so this
> specific
> core dump cannot be trusted. This might even be the same problem as
> bug
> 3551 above. This could be considered a blocker at least until we know
> more, I guess.
>

Thank you.

Amos
Received on Wed Jul 04 2012 - 23:34:49 MDT

This archive was generated by hypermail 2.2.0 : Thu Jul 05 2012 - 12:00:03 MDT