Re: "concurrency" attribute and questions.

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Mon, 6 Apr 2009 14:57:17 +1200 (NZST)

> Amos,
> Thanks for your responses, they make things clearer for me, so now I
> can ask better questions :) What I'd like to do is have my PERL
> helper fork as necessary, rather than starting up (children=50) or
> (children=100), or N external_acl_type "instances" which is not
> efficient and based off of an indeterminate number of users, 50 or 100
> may not be enough, or at times too many.
>
> What settings in the squid.conf line tell Squid the external helper
> will fork to handle subsequent objects? Below is my current line:
> external_acl_type eXhelperI children=1 %LOGIN %METHOD %{Host}
> /usr/lib/squid/eXhelper.pl

There are none. squid knows only of the stdin/stdout/stderr pipes for data
to/from the helper, and how many concurrent requests to send it.

>
> since I set "children=1" only one "eXhelper.pl" starts up with Squid,
> with the idea in mind that "eXhelper" forks children processes as
> necessary. I'm still trying to determine what state information Squid
> passes to the external helper besides the %LOGIN/%METHOD... [ below
> you mentioned an ID token, are you referring to the %LOGIN ID token?
> Or something else? ].

For url_rewrite helper:
  http://wiki.squid-cache.org/Features/Redirectors

For external_acl_type helper:
  %ID %user_format

  Where %user_format is the exact same list of %X parameters you place
  in squid.conf before the helper program name.

For auth helpers it depends on the helper scheme.

>
> I understand that Squid forks the eXhelper.pl, which means Squid owns
> the ppid(parent process ID) of the "eXhelper.pl" - ideally I'd like to
> have this single child then, fork subprocesses too, currently I'm
> uncertain what input trigger(or signal) if any exists, to have the
> single external helper fork the subprocess to check the object and how
> to uniquely ensure the "OK" or "ERR" goes back to the calling "ID
> token"

That all must be part of the helper scope. My knowledge of perl is
read-only with limited adaptations. AFAIK the foreach() loop can be used
somehow for parallelism and does all the necessary background stuff
instead of the common while(<STDIN>) loop.

>
> Thank you again - I did write some comments below.
>
> On Sun, Apr 5, 2009 at 6:05 AM, Amos Jeffries <squid3_at_treenet.co.nz>
> wrote:
>> louis gonzales wrote:
>>>
>>> List,
>>> 1) for the "concurrency" attribute does this simply indicate how many
>>> items in a batch will be sent to the external helper?
>>
>> Sort of. The strict definition of 'batch' does not apply. Better to
>> think of
>> it as a window max-size.
>
> Louis: So should I have the PERL helper "buffer" the data passed to
> it, rather than "reading line by line" - if "buffer" what are the
> "start and end" identifiers?

<shrug> I think the pipe buffers should be holding it if your helper has
not pulled it in to start processing. Though with real internal
parallelism you should not get any major delay down the pipes.

>
>>
>> So from 0 to N-concurrency items will be passed straight to the helper
>> before squid starts waiting for their replies to free-up the slots.
>
> Louis: If 0-(jth) objects belong to a specific user's request and
> (jth)-Nth belong to a different user request, assuming concurrency is
> set to N, how does one differentiate in the external helper, which set
> belongs to who (I'm using the %LOGIN parameter so I know which userID,
> as authenticated by ldap, is making the request) - in other words,
> after I've determined "OK" for the 0-(jth) and "ERR" for the
> (jth)-Nth, the specific instance of the helper will need to return two
> different values. Basically my helper checks each one of the squid
> passed Objects(URL/%LOGIN) pairs against the ACL's in the postgresql
> database. My use case, guarantees the only end user application will
> be a web browser, so with that assumption, when the end user opens
> www.foxnews.com, for instance, there are a multitude of objects, so my
> specific question is: when squid goes to retrieve all of these objects
> for the requesting user, does Squid - a) with concurrency set high
> enough, send all of these objects to the same external helper instance
> and await a single "OR" or "ERR"? and b) with concurrency off, does
> Squid one-to-one object-to-external_helper_instance awaiting for "OK"
> or "ERR"?

(a) any load is sent to next available helper slot.
 starting with helper #1, when thats full starts on helper #2 etc.

So helper stats show requests-sent to each helper has high load on #1,
decreasing down to last few children only get used under peak load or
never if over-provisioned.

(b) one-to-one *always* whether or not concurrency is used.
 one test sent to helper == one response expected back.
(forget the batching idea completely).

Any variation in (b) causes undefined behavior in squid (usually a hung
client connection or assertion error)

>
>>
>>>
>>> 1.1) assuming concurrency is set to "6" for example, and let's assume
>>> a user's browser session sends out "7" actual URL's through the proxy
>>> request - does this mean "6" will go to the first instance of the
>>> external helper, and the "7th" will go to a second instance of the
>>> helper?
>>
>> 1-6 will go straight through probably with the IDs 1->6.
>> #7 may or may not go straight through, depending if one of the first 6
>> was
>> finished at that time.
>
> Louis: is it ever possible with concurrency enabled, that objects from
> two different users will enter into a single external helper instance?
>

Yes, almost guaranteed to happen.

Also certain is that one users requests will be meshed randomly across
time without any fixed order.

>>
>>>
>>> 1.1.1) Assuming the 6 from the first part of the batch return "OK" and
>>> the 7th returns "ERR", will the user's browser session, render the 6
>>> and not render the 7th?
>>
>> Depends entirely on how the ERR/OK results are used in squid.conf.
>>
>> (you might be denying on OK or allowing on ERR).
>
>>
>>> More importantly, how does Squid know that
>>> the two batches - one of 6, and one with 1, for the 7 total, know that
>>> all 7 came from the same browser session?
>>
>> There is no such thing as a browser session to Squid.
>>
>> Each is a separate object, these 7 happen MAY be coming from the same
>> IP,
>> but may be different software for all squid cares, or may come from more
>> than one IP completely.
>
> Louis: right, but Squid obviously has to know which IP the request
> came from, in order to serve the page(s), so when the external helper
> processes the "OK" or "ERR", certainly those will trace back the path
> from which they came to the "correct requesting application(browser or
> other)".

They will track back down the same TCP-pipe as request came in.
But there is no guarantee once the link extends to the Internet that:
 * the IP at the other end is a web browser (robots, other proxies, manual
requests).
 * the TCP-pipe itself was not using request pipelining (pipelining proxy,
some browsers)
 * the TCP-pipe may server multiple clients (collapsed forwarding proxy)
 * the IP squid identifies with the pipe is the clients (NAT, interception)

>>>
>>> What I have currently:
>>> - openldap with postgresql, used for my "user database", which permits
>>> me to use the "auth_param squid_ldap_auth" module to authenticate my
>>> users with.
>>> - a postgresql database storing my acl's for the given user database
>>>
>>> Process:
>>> Step1: user authenticates through squid_ldap_auth
>>> Step2: the user requested URL(and obviously all images, content, ...)
>>> get passed to the external helper
>>> Step3: external helper checks those URL's against the database for the
>>> specific user and then determines "OK" or "ERR"
>>>
>>> Issue1:
>>> How to have the user requested URL(and all images, content, ...) get
>>> passed as a batch/bundle, to a single external helper instance, so I
>>> can collectively determine "OK" or "ERR"
>>>
>>> Any ideas? Is the "concurrency" attribute to declare a maximum number
>>> of "requests" that go to a single external helper instance?
>>
>> number of *parallel* requests the helper can process. Most helpers
>> shipped
>> with Squid are non-parallel (concurrency=1).
>
>>
>>> So if I
>>> set concurrency to 15, should I have the external helper read count++
>>> while STDIN lines come in, until no more, then I know I have X number
>>> in a batch/bundle?
>>
>> Depends on the language your helper is coded in. As long as it can
>> process
>> 15 lines of input in parallel without mixing anything up.
>
> Louis: PERL. I asked above, should I be "buffering" the objects/data
> sent from Squid to the external helper or, reading line by line? If
> buffer, how do I identify "start and end" of unique object request
> data?

The stdin pipe can be read as a \n delimited set of lines. Each line is a
unique object test.

>
>>
>> Looks like a perl helper, they can do parallel just fine with no special
>> reads needed. But it must handle the extra ID token at the start of the
>> line
>> properly.
>
> Louis: what is the ID token, if not the %LOGIN and %{HOST} information?

Integer # for the helper slot (a helper with concurrency=N will see ID 0,
1, 2, ... , N-1 arriving).
With up to N \n delimited lines available on stdin for reading at a burst.

>
>>
>>>
>>> Obviously there is no way to predetermine how many URL's/URI's will
>>> need to be checked against the database, so if I set concurrency to
>>> 1024, "presuming to be high enough" that no single request will max it
>>> out, then I can just count++ and when the external helper is done
>>> counting STDIN readlines, I can process to determine "OK" or "ERR" for
>>> that specific request?
>>
>> additional point to this:
>> the ttl=N option will cache the OK/ERR result for that lookup for N
>> seconds. This can greatly reduces the number of tests passed back even
>> further.
> Louis: Thanks! This is good to know!!! :)
>
>>
>>>
>>> Issue2:
>>> I'd like to just have a single external helper instance start up, that
>>> can fork() and deal with each URL/URI request, however, I'm not sure
>>> Squid in its current incarnation passes enough information OR doesn't
>>> permit specific enough passback (from the helper) information, to make
>>> this happen.
>>
>> Squid passes an ID for each line of input. As long as the result goes
>> back
>> out stdout of the helper Squid itself forked with that ID at the front
>> Squid
>> does not care the order of responses.
>
> Louis: does this mean Squid is in a "waitpid()" mode for the pending
> external helper that was forked? Is it using some named pipe?
>

Squid is on asynchronous non-blocking reads from the child helpers pipe FD
using poll/epoll/select etc. just another input socket for Squid.

>>
>> You will need to make sure your parallel child stdout/stderr write to
>> your
>> parent helpers stdout/stderr. But it should be possible.
>>

Amos
Received on Mon Apr 06 2009 - 01:57:23 MDT

This archive was generated by hypermail 2.2.0 : Mon Apr 06 2009 - 12:00:03 MDT