Re: [squid-users] ICAP protocol error

From: Eliezer Croitoru <eliezer_at_ngtech.co.il>
Date: Thu, 13 Jun 2013 23:30:19 +0300

Hey,

Since you are using it only for filtering this seems to me like you are
not using the machine CPU at all using only 4 instances.
you can use more cpu on the same machine with SMP support(and without).
I won't say to you "try" and "experiment" on your clients since it's
rude but since it's only filtering and there is no cache involved you
can easily and smoothly even use just another instance of squid 3.3.5 to
test the problems.
I would give you a nice advice to try 3.3 just to let these monsters
make good use of their cpu.
If i'm not wrong each machine can handle more cpu more connections etc
then they are handling right now.
Again you will need to think about it and plan the migration.
the SMP sometimes is not just out of the box an specifically in your
scenario which is a very loaded server.

Another issue is the network interface which can slow down things.
If you can use one interface for only ICAP service connection I would go
for it.
Also if you can use more then one interface like in bonding\teaming or
fiber channel I believe that some network issues will be "unavailable"
to your case.

If you can probe the ICAP service with a simple script it can give you
better indication if the fault is on squid 3.1 problem or ICAP service
is too loaded.

You can use TCPDUMP to capture one of the many ICAP reqmod request and
write a small nagios like script that will say "ok" "err" and will
report them on mrtg or any other way.
This way you can coordinate find point and shoot to the right
direction(squid or ICAP service).

This can also be used if you use nagios:
http://exchange.nagios.org/directory/Plugins/Anti-2DVirus/check_icap-2Epl/details

What monitoring system are you using?nagios?zabbix?munin?Icinga?prtg?

Thanks,
Eliezer

On 6/13/2013 5:22 PM, guest01 wrote:
> Hi,
>
> Thanks for your answers.
>
> At the moment, we have 4 "monster"-servers, no indication of any
> performance issues. (there is an extensive munin monitoring)
>
> TCP-states: http://prntscr.com/19qle2
> CPU: http://prntscr.com/19qltm
> Load: http://prntscr.com/19qlwe
> Vmstat: http://prntscr.com/19qm3v
> Bandwidth: http://prntscr.com/19qmc4
>
> We have 4 squid instances per server and 4 servers, handling all
> together approx 2000rps without harddisc-caching. Half of them is
> doing kerberos authentication and the other half is doing LDAP
> authentication. Content scanning is done by a couple (6 at the moment)
> of webwasher appliances. These are my cache settings per instance:
> # cache specific settings
> cache_replacement_policy heap LFUDA
> cache_mem 1600 MB
> memory_replacement_policy heap LFUDA
> maximum_object_size_in_memory 2048 KB
> memory_pools off
> cache_swap_low 85
> cache_swap_high 90
>
> My plan is to adjust a couple of icap timers and increase icap
> debugging to 93,4 or 93,5) I found these messages:
> 2013/06/13 03:49:42| essential ICAP service is down after an options
> fetch failure: icap://10.122.125.48:1344/wwreqmod [down,!opt]
> 2013/06/13 11:09:33.530| essential ICAP service is suspended:
> icap://10.122.125.48:1344/wwreqmod [down,susp,fail11]
>
> What does down,!opt or down,susp,fail11 mean?
>
> thanks!
> Peter
>
>
>
> On Thu, Jun 13, 2013 at 2:41 AM, Eliezer Croitoru <eliezer_at_ngtech.co.il> wrote:
>> Hey,
>>
>> There was a bug that is related to LOAD on a server.
>> your server is a monster!!
>> squid 3.1.12 cannot even use the ammount of CPU you have on this machine as
>> far as I can tell from my knowledge unless you have couple clever ideas in
>> your sleeve.(routing marking etc..)
>>
>> To make sure what the problem is I would recommend also to verify the load
>> on the server in a manner of open and half open sessions\connections to
>> squid and icap service\server.
>> Are you using this squid server for filtering only? or also cache?
>> if so what is the cache size?
>>
>> The above questions can help us determine your situation and try to help you
>> verify that the culprit is a specific bug that from my testings on 3.3.5
>> doesn't exists anymore.
>> if you are up for the task to verify the loads on the server I can tell you
>> it's a 90% go on the bug.
>> What I had was a problem when squid was going over the 900 RPS the ICAP
>> service would go into a mode which stopped responding to requests.(and
>> showed the mentioned screen)
>> This bug was tested on a very slow machine compared to yours.
>> On a monster like yours this effect that I have tested might not appear with
>> the same side effects like "denial of service" but rather "interruption of
>> service" which your monster recover very quickly from.
>>
>> I'm here if you need any assistance,
>> Eliezer
>>
>>
>> On 6/12/2013 4:57 PM, guest01 wrote:
>>>
>>> Hi guys,
>>>
>>> We are currently using Squid 3.1.12 (old, I know) on RHEL 5.8 64bit
>>> (HP ProLiant DL380 G7 with 16 CPU and 28GB RAM)
>>> Squid Cache: Version 3.1.12
>>> configure options: '--enable-ssl' '--enable-icap-client'
>>> '--sysconfdir=/etc/squid' '--enable-async-io' '--enable-snmp'
>>> '--enable-poll' '--with-maxfd=32768' '--enable-storeio=aufs'
>>> '--enable-removal-policies=heap,lru' '--enable-epoll'
>>> '--disable-ident-lookups' '--enable-truncate'
>>> '--with-logdir=/var/log/squid' '--with-pidfile=/var/run/squid.pid'
>>> '--with-default-user=squid' '--prefix=/opt/squid' '--enable-auth=basic
>>> digest ntlm negotiate'
>>> '-enable-negotiate-auth-helpers=squid_kerb_auth'
>>> --with-squid=/home/squid/squid-3.1.12 --enable-ltdl-convenience
>>>
>>> As ICAP server, we are using McAfee Webwasher 6.9 (old too, I know).
>>> Up until recently we hardly had problems with this environment.
>>> Squid is doing authentication via Kerberos and passing the username to
>>> the Webwasher, which is doing a LDAP lookup to find the users groups
>>> and assign a policy based on group membership.
>>> We have multiple Squids and multiple Webwasher with a hardware
>>> loadbalancer, approx 15k users.
>>>
>>> Since a couple of weeks, we almost daily get an ICAP server error
>>> message, similar to:
>>> http://support.kaspersky.com/2723
>>> Unfortunately, I cannot figure out why. In blame the webwasher, but I
>>> am not 100% sure.
>>>
>>> This is my ICAP configuration:
>>> #ICAP
>>> icap_enable on
>>> icap_send_client_ip on
>>> icap_send_client_username on
>>> icap_preview_enable on
>>> icap_preview_size 30
>>> icap_uses_indirect_client off
>>> icap_persistent_connections on
>>> icap_client_username_encode on
>>> icap_client_username_header X-Authenticated-User
>>> icap_service service_req reqmod_precache bypass=0
>>> icap://10.122.125.48:1344/wwreqmod
>>> adaptation_access service_req deny favicon
>>> adaptation_access service_req deny to_localhost
>>> adaptation_access service_req deny from_localnet
>>> adaptation_access service_req deny whitelist
>>> adaptation_access service_req deny dst_whitelist
>>> adaptation_access service_req deny icap_bypass_src
>>> adaptation_access service_req deny icap_bypass_dst
>>> adaptation_access service_req allow all
>>> icap_service service_resp respmod_precache bypass=0
>>> icap://10.122.125.48:1344/wwrespmod
>>> adaptation_access service_resp deny favicon
>>> adaptation_access service_resp deny to_localhost
>>> adaptation_access service_resp deny from_localnet
>>> adaptation_access service_resp deny whitelist
>>> adaptation_access service_resp deny dst_whitelist
>>> adaptation_access service_resp deny icap_bypass_src
>>> adaptation_access service_resp deny icap_bypass_dst
>>> adaptation_access service_resp allow all
>>>
>>> Could an upgrade (either to 3.2 or to 3.3) solve this problem (There
>>> are more icap options in recent squid versions available)?
>>> Unfortunately, this is a rather complex organisational process, that's
>>> why I did not do that yet.
>>> I do have a test machine, but this ICAP error is not reproducible,
>>> only in production. Server load and IO-througput are ok, there is
>>> nothing suspicious on the server. I recently activated icap debug
>>> option 93 and found following message:
>>> 2013/06/12 15:32:15| suspending ICAP service for too many failures
>>> 2013/06/12 15:32:15| essential ICAP service is suspended:
>>> icap://10.122.125.48:1344/wwrespmod [down,susp,fail11]
>>> 2013/06/12 15:35:15| essential ICAP service is up:
>>> icap://10.122.125.48:1344/wwreqmod [up]
>>> 2013/06/12 15:35:15| essential ICAP service is up:
>>> icap://10.122.125.48:1344/wwrespmod [up]
>>> I don't know why this check failed, but it usually does not occur when
>>> clients are getting the icap protocol error page.
>>>
>>> Another possibility would be the ICAP bypass, but our ICAP server is
>>> doing anti-Malware-checking and that's why I don't want to activate
>>> this feature.
>>>
>>> Does anybody have other ideas?
>>>
>>> Thanks!
>>> Peter
>>>
>>
Received on Thu Jun 13 2013 - 20:30:44 MDT

This archive was generated by hypermail 2.2.0 : Fri Jun 14 2013 - 12:00:29 MDT