Re: storage manager ideas

From: Adrian Chadd <adrian@dont-contact.us>
Date: Thu, 1 Feb 2007 07:33:42 +0800

On Wed, Jan 31, 2007, Alex Rousskov wrote:

> I agree that read-parse-dump-write-read-parse-dump-write sequence is
> inefficient for message headers and other metadata. IIRC, we talked
> about optimizing this at least since 1998. Stored binary metadata has
> its drawbacks, but overall it is probably a win for a
> performance-sensitive proxy.

:)

There's one less parse pass in Squid-2 HEAD now - the client side will
wait for the first storeClientCopy() to complete and then just snaffle
the headers from the store client -> mem object -> reply. If the headers
aren't there then they won't ever be. Its still a straight copy of all
the reply information but its better than the re-parse path and the
semantics lean themselves to ref-counting stuff later.

The next optimisation will be "pass pre-built reply struct in before
the first storeAppend" which will clone the reply data and stuff it into
the store. This step, however, does presuppose the reply status/headerdata
isn't stored away in the data stream so I'll have to do quite a bit of
work at the same time to make this eventuate.

> For many other problems and changes you are talking about, I would try
> reworking MemBuf or providing a similar object that would allow
> higher-level code to "copy" and "concatenate" chunks of memory without
> the actual copy taking place. Visualize a MemBuf with an offset and
> subsize fields. Now add support for a chain of such buffers that looks
> like a single buffer for higher-level code.

Henrik and I already have this. I wrote a replacement http request and
header parser which works on refcounted buffers. The buffers aren't fixed
size (so I call realloc where required) but are refcounted. The strings
are an 'extent' on top of a refcounted buffer.

The refcounted buffer pretty much looks like a non-refcounted MemBuf.

Whats missing is the concept of a buffer chain, so concatenate is cheap.
There's a few libraries which implement this stuff (eg vstr) which
I'm using for inspiration. We don't need anything -that- complicated
to get much better performance.

The store will just contain a chain of 0 or more extents with whatever
backing buffers are required. This way you can pull nifty tricks such as
reading the chunked TE'ed data from the server connection and store
the non-chunked extents in the store without having to do the copy tricks
Henrik does. It'll remain to be seen how optimal that is - modern OSes
-really like- page-aligned buffers for things so I'll benchmark how things
perform once its written to see whether there's a substantial difference.

(My gut feeling is "yes, there will be" but to be honest, doing the above
will probably give us a huge performance and code cleanliness boost over
squid-2 and squid-3 that'll be worth it as an intermediary step.)

> As an added bonus, it may be possible to avoid a lot of the copying when
> parsing headers because string-based headers would be able to refer to
> portions of the original I/O buffer.

Already done. :) And yes, its pretty damned fast.

> Proper reference counting of true/allocated buffers would be required to
> keep overall memory consumption comparable to current Squid, of course.

> Finally, I am not sure I agree regarding storage decision making time.
> An optimized storage system (the interesting case) would probably
> buffer/merge small chunks and would probably not store object chunks
> sequentially on disk, so the issue of the total object size becomes
> unimportant.

Its only important when deciding how to write it all to disk. If you delay
the layout decision you can do interesting packing tricks. You want to
pack with some leaning towards temporal locality, for example, so your reads
can read >1 object back at once. You can interleave small and large objects
on disk so your disk access algorithm has a chance of being able to do both
at a reasonable clip rather than suffering from starvation issues.
(The last hasn't been tested out, but I got a feeling it'd be an issue
from my benchmarking.)

I did some unofficial benchmarking when fiddling with COSS about a year ago
and found the upper limit for random IO was transactions per second based
long before throughput became a limiting factor. I was seeing 200-300
tps on these test SATA NCQ disks (and probably more with SCSI) up to a couple
hundred kilobytes per transaction. This isn't new (there's plenty of papers
which reference this, which I hope to find again and put on the new squid
website) but it shows there's a huge room for improvement wrt small object
sizes.

> Needless to say, I believe this work should be done in Squid3.1 code
> base or later :-/.

People are running squid-2 and want to keep running it for now. I think the
best thing for my work is to get it into squid-2 so people stay interested in
Squid and so I have a stable platform to do my development with. Once its done
and tested we can sit back, look at the pluses and minuses, then extract it
all out and shoehorn it into squid-3.

This requires Squid-3 to be stable by then. :) If its stable and ready for
production then I'm really all for it. I really do want to take advantage of
C++ constructs here (where appropriate!) to enforce data type semantics.
Heck, refcounting buffers would be a cinch in C++.
Unfortunately Squid-3 is worse than Squid-2 in the "does a hell of a lot of
a hell of a little" problem - try profiling Squid-2 or Squid-3 and see if
you can find a single area or two that would give a big performance boost.
I've fixed most of them.. :/

Adrian
Received on Wed Jan 31 2007 - 16:27:51 MST

This archive was generated by hypermail pre-2.1.9 : Thu Feb 01 2007 - 12:00:02 MST