Re: [squid-users] transparent squid + clamav + https

From: Amos Jeffries <squid3_at_treenet.co.nz>
Date: Wed, 17 Mar 2010 03:53:01 +0000

On Wed, 17 Mar 2010 05:24:38 +0200, Henrik K <hege_at_hege.li> wrote:
> On Tue, Mar 16, 2010 at 08:58:27PM +0100, Henrik Nordström wrote:
>> mån 2010-03-15 klockan 18:47 +0200 skrev Henrik K:
>>
>> > If you don't want this limitation, you can use HAVP. It scans the
file
>> > while
>> > it's being transferred to client, while keeping small part of it
>> > buffered
>> > (in case of virus, it is not transferred so client can't open
>> > incomplete
>> > file). It's as close to transparent as you can get.
>>
>> That's also one of the three modes supported by c-icap clamav service.
>
> Such comment can only be made when one doesn't understand what HAVP
does.
> It
> is NOT the same thing.
>
> http://www.server-side.de/documentation.htm
>
> While one can speculate about the usefulness of scanning huge files at
HTTP
> level, HAVP with mandatory locking does it much more efficiently.
>
> C-icap will only call the scanner after a file is completely received,
> resulting in additional wait and a load spike.
>
> HAVP starts scanning the file immediately as it is received from the
server
> and gradually unlocked. When c-icap has just started scanning the file,
> HAVP
> has already scanned most (if not all) of it and is sending final bytes
to
> client. If a virus had happened to be found, HAVP would have already
> stopped
> the unnecessary download without wasting time on the whole file. This
also
> works on ZIP files as it first tries to download the header at end of
the
> file using Range request.

So HAVP is designed specifically to send client scanned parts of the file
before the entire thing is checked?
That explains something that I was wondering about...

Consider this scenario which I have seen in the wild:

Background: Clients visit website A and fetch a large document PDF file.
Unknown to the website author the server has been infected and PDF is one
of the files which get a macro virus appended. The CRM system records in a
database the upload and change times for efficient if-modified responses.
The server admin is alerted and runs a virus scan, cleaning the files some
time later. The CRM database gets omitted from the update.

Imagining that HAVP was in use in a proxy between this site and a user...

During the infected period imaginary-HAVP scans the documents and sends a
large "clean" prefix to all visitors.
 BUT... aborts when the appended infection is detected. Browser is lucky
enough to notice the file is incomplete and retires later with a range
request for the missing bit.

 a) during the infected period the fetched ranges will never succeed.

 b) after the infection is cleaned up the file will pass through
imaginary-HAVP and client will get a truncated version. With complete-file
being indicated.

This is where the problem comes in. Being a macro infection one of the
changes to the file was that the virus appended some undetectable jump code
at the beginning to go with the virus at the end.

We are left with the situation where intermediary proxies are holding
corrupted files (first part being original infected with jump, followed by
terminal bytes of teh file. the server is left with a pristine and working
file. New visitors loading it will be fine, and sill analysts coming along
later.

However for clients visiting through one of the proxies which cached the
file meanwhile ... One of two things will happen to depending on the file
viewer used:
 1) dumb viewer will try to run the random part of file (now text!) where
virus inserted itself as binary code and crash.
 2) smart viewer will notice the missing/corrupt macro (its past the end
of file maybe) and display the file without running it. However, even then
there is a discrepancy in file prefix and some of the content appears
corrupted.

This type of traffic is the #1 reason for buffering until fully processed.
I do like the idea of incremental scanning as it arrives though. That will
at least reduce the delays to very little more than the total receiving
time.

Amos
Received on Wed Mar 17 2010 - 03:53:06 MDT

This archive was generated by hypermail 2.2.0 : Wed Mar 17 2010 - 12:00:04 MDT