Before we go much further, please let me explain my use case and
considerations.
My context is the need for a more capable and highly performant shadow
memory representation for a reduced memory requiring alternative to taint
analysis. This lower memory requirement and higher taint identity is
achieved by the range based encoding, which consumes less than the
bitvector shadow and affords more for identity storage. The scalability
requirement is rather wild-we need only ever be able to insert. We don't
need to really look things up-only test for a single special key upon
insertion, which is rather trivial, requiring only a single cpu test
instruction.
The way I see the next generation dynamic binary instrumentation library &
runtime operating is this:
rather than use a bitmap index, requiring O(N) and having fast update times
but overall poor identity storage, we can use an index and some careful
optimizations to reduce the number of "cache misses" by taking advantage of
the context. These optimizations actually take advantage of statistical
characteristics of typical benign code rather than anything to do with
actual data structures.
By using a pre-allocated trie, we can store the first three bytes of any
given address in the lookup, and retrieve/update/insert into that table in
just a single instruction. That table fits in main memory easily, and
represents keys into any range-offset encoding data structure. Between each
insertion, the algorithm just compares the first three bytes. If that test
succeeds, only the variables that were last updated are incremented; ie the
range-offset encoding is updated (one more instruction). That makes 2-3
instructions. Statistically speaking, the majority memory is actually
affected that is rather contiguous both as source and destination operands.
On modern hardward, we may even get away with storing multiple recent
lookups contexts in the CPU registers, being as there are quite a few
available to work with, reducing penalties even further.
I'm curious about using C++'s new relaxed memory capabilities with fences
and what not, and possibly some ring buffers to maintain actually copies of
subsegments of said trie and devoting a dedicated maintainer thread manage
merging those upon each cache flush from each of the respective worker
threads. So, total core overhead is somewhere like *N+C additional*, where
N is the number of cores that the target process requires, and the C is the
one (or however many) additional maintainer threads that merge between the
workers and the coherent data structure. Hell, if it works out well, let
each worker have it's own trie. But overall, this is a data structure that
is very sensitive to concurrent operation, because target threads may race
with themselves, and not only that, but we do not want analysis code to
either race with it's "brethren" or with that of the application.
So, really, wait-free red black trees were the only thing that I could
think of that was the fastest (in a concurrent context) that would also
affort correctness, and yet also be able to store range value encodings.
But I'm open to anything that will operate in a concurrent context that
will satisfy the requirements will do however.
Now, back to everything you just said-
@rebalancing generates a lot of cache line invalidation traffic
Do you think a lazy rebalancing strategem with possibly dedicated worker
thread could be applied?
@nedtries provide a size_t key index algorithms. You can build the
rest from that.
I looked over the part where you linked me in further depth. The graphs are
good, but I'm really disconcerted about the applicability of ned-tries
because the metrics aren't what I'm interested in. An insertion has to be
performed once per memory instruction. I read and have heard that for most
software, memory operations tend to occur statistically about 33% of the
time. So, I have to focus on # of instructions per insertion. Update is
really just insertion. Nothing really is deleted, just updated.
@Bitwise tries are hugely superior to red black trees for close and
exact find lookups
"Bitwise" - I don't want to fall back to using a bitvector. Those have
already be applied in academia, and are publicly available. That's where I
think everyone falls short; you might get ultimate speed there, but with
good concurrent design and hardware primitive usage you can vastly reduce
*perceived latency*. Also, identity diversity is a moot topic with
bitvector shadow memory.
@I think you may need to scale your estimates...
Well, *immediate* tree rebalancing shouldn't ever occur unless the worst
case occurs, which is that the two target application threads content for
memory on a near same time basis. Also, all trees operate on colocal
data-regionally colocated within proximally 255 bytes to be exact (on 32bit
hardware). In fact, most tree insertion operations can be made very highly
lazy-do not really update any of the other nodes of the tree, just
continuously insert, even if numerous conflicting addresses aren't
immediately resolved. I think identity storage schemes can be proven
associative and commutative, greatly facilitating both concurrency and
overall correctness. In this way, rebalancing is a rather infrequent
operation.
On Thu, Mar 5, 2015 at 7:30 PM, Niall Douglas
On 5 Mar 2015 at 13:49, Kenneth Adam Miller wrote:
Well, my requirements are that I store range-offset encodings and be able to access them for any reason as fast as possible. So, what I was thinking about doing was storing a range-value encoding as a tree, but it could be any kind of self balancing tree. The thing with red-black trees is, I know how to make them wait-free by reading and implementing a whitepaper that UTDallas published.
The rebalancing generates a lot of cache line invalidation traffic, and ruins concurrent modify performance. Academic theory often misses the reality.
Does your nedtree cover the ability to store range value encodings? Because my thinking is, with a trie, you can get the first 3 bytes of an integer without giving up to much space (it requires only slightly less than 32 megabytes contiguously stored). Each entry in that 32 megabytes can be a pointer to a red-black tree node, wherein when you do an insertion, you revert back to the wait-free redblack tree implementation. So, you can still do range-value encodings, but it will also be far, far faster than if you didn't having this caching to accelerate it.
nedtries provide a size_t key index algorithms. You can build the rest from that.
Bitwise tries are hugely superior to red black trees for close and exact find lookups, but inferior for closest find lookups. You can see some graphs at http://www.nedprod.com/programs/portable/nedtries/.
I estimate that it requires at most about 11 instructions in the absolute worst case scenario to do an insertion, and typical 7 instruction on the usual "cache-miss"., and amortized 2 instruction for contiguous memory insertions. It is very highly scalable, and can be managed in flexibly with various concurrency runtime patterns.
I think you may need to scale your estimates by about 10x. A cache line miss costs about 250 CPU cycles alone, and any kind of tree algorithm spends most of its time waiting on cache lines.
This is why bitwise tries are much faster than red black trees on out of order CPUs: they execute out of order much better. On an in order core like Intel Atom bitwise tries are extremely slow, and they're not great on even a Cortex A15 ARM either.
nedtries costs about 100 CPU cycles per insert/find/remove. That is exceptionally well balanced, most algorithms are much slower at say insert than finds (e.g. hash tables). If you do equally do finding, inserting and removing, and your item count is less than 10,000, bitwise tries are unbeatable.
Niall
-- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost