Re: [code] [for discussion] map-trie from Francesco on 2014-06-16 (squid-dev)

From: Francesco <gkinkie_at_gmail.com>
Date: Mon, 16 Jun 2014 19:40:43 +0200

On 14 Jun 2014, at 18:34, Alex Rousskov <rousskov_at_measurement-factory.com> wrote:

> On 06/14/2014 12:39 AM, Kinkie wrote:
>> On Fri, Jun 13, 2014 at 4:50 PM, Alex Rousskov
>> <rousskov_at_measurement-factory.com> wrote:
>>> On 06/11/2014 04:43 PM, Kinkie wrote:
>>>> level-compact-trie: the mean time is 11 sec; all runs take between
>>>> 10.782 and 11.354 secs; 18 Mb of core used
>>>
>>>> full-trie: mean is 7.5 secs +- 0.2secs; 85 Mb of core.
>>>
>>>> splay-based: mean time is 16.3sec; all runs take between 16.193 and
>>>> 16.427 secs; 14 Mb of core
>>>
>>> How about std::map? Have you considered using a standard class for this
>>> purpose?
>>
>> std::map doesn't offer efficient prefix matching, but sure, I'll try
>> to prepare a test run on the same data to establish a baseline.
>
> Is explicit support to prefix matching actually required though? I am
> thinking of using std::map::lower_bound or similar to find the vicinity
> of a possible prefix match. For example, when the map has a b.c domain
> stored, and you are searching for a.b.c, the lower_bound method returns
> a "pointer" to the stored b.c node. After that, a simple prefix test
> between the two strings can return the final match/mismatch answer.
>
> Needless to say, all domain names ought to be stored/compared in the
> reverse order of their labels. For example, a.b.c is stored internally
> as c.b.a. It just feels more natural for me to render it as a.b.c in a
> human-oriented text like this email.
>
> And if we need to support prefixes that do not end at domain label
> boundaries (i.e., *foo.bar should match zzzfoo.bar), then we just treat
> each character as a "label".

More data: I haven't yet tested out your solution, but yet another round of TrieNode (it's easier), which uses a variable-length std::vector + an offset to compact the TrieNode to the actual range that's needed for that one node. The results are very interesting: This implementation uses the least memory of the trie-based solutes (16.5Mb for the test data set) and performance is only 12% worse than full-fledged trie (8.5sec +- 0.3sec over 10 tests).

More to come.

Kinkie
Received on Mon Jun 16 2014 - 17:40:55 MDT

This archive was generated by hypermail 2.2.0 : Mon Jun 16 2014 - 12:00:11 MDT