B-tree Path Hints

jvanderbot · on July 30, 2021

In the mid 80s to 90s, there was a lot of interest around "query responsive" data structures. There still is, I'm sure, but that was what I studied and I've moved on since then.

Of these, the fibbonaci and pairing heaps were the most well known products, I reckon.

They essentially move "Active" parts of the data structure upwards near the top of the tree. The use case is different (being heaps), but I wouldn't be surprised if that line of R&D has continued, or if there's a huge gap in performance that can be reclaimed with "intelligent" guesses like this one.

I'm rambling and conjecturing, but maybe someday branch prediction will get a huge boost to their already tangible smarts and we'll see another explosion in practical CPU throughput, combined with a little help from algs and data structures, of course.

nmwnmw · on July 30, 2021

My personal favorite of these is the Splay Tree[1]. I've never used it in production, but it's really simple to implement and reason about. Though my main experience was trying (and failing) to prove amortized bounds on the operations.

[1] https://en.wikipedia.org/wiki/Splay_tree

3PS · on July 31, 2021

If anything, the incredibly convoluted amortized analysis that goes into proving bounds on the cost of splay tree operations is perhaps more iconic than the actual splay tree data structure.

jvanderbot · on July 30, 2021

Tarjan and Sleator are legends.

scottlamb · on July 30, 2021

Does that mean any access to these data structures needs an exclusive lock? One advantage to the "path hints" approach described in the article is that the path hints can be stored separately from the btree itself. The author writes:

> For single-threaded programs, it’s possible to use one shared path hint per B-tree for the life of the program.

> For multi-threaded programs, I find it best to use one path hint per B-tree , per thread.

> For server-client programs, one path hint per B-tree, per client should suffice.

Seems likely multiple threads/clients would have different access patterns anyway, so the hints might actually be better this way in addition to more concurrent.

edit: also, this approach feels a lot lighter-weight to me than mutating the tree: I can imagine the mutations involving frequent allocation/deallocation where this hint is just a vector that grows to the maximum depth and stays there.

bradleyjg · on July 30, 2021

Had the same thought. Making every access RW seems like a cure worse than the disease in a multithreaded situation.

jvanderbot · on July 30, 2021

> mid 80s

binarybanana · on July 31, 2021

There are ways to make it work. The simplest would be to just not do any splaying most of the time. No splaying means no locking is necessary, just an atomic read to check if anything changed. Apparently it's actually good for performance[1] to splay only occasionally instead of on every access. Another would be to use atomics to do lockless updates. Probably best to only do partial splaying, or only occasionally like above, but it's doable[2].

If the data is actually stored in an array and the tree structure is actually just a map that stores the "nodes" as a tuple (left, element, right) in another array, where left, element and right are indexes into the backing array, then you could just have one such map per thread and only had to use synchronization if anything is added or deleted. Or even have purely thread local addition/deletion if that is sufficient and no locking at all is desirable. A multi-splay tree essentially.

Storing nodes as elements in an array is probably a good idea for performance in general, even in the single threaded case. It means you don't have to allocate each element individually and that things are stored in a contiguous block of memory. The elements are not going to be in the correct order since they fixed in place in the backing array, but they're still going to be much more local than if they're allocated individually. Allocating from an arena/bump allocator would have similar advantages though.

You could also store everything in a single array, replacing the node's element index with the data itself and then use atomics again. This is how I've seen it done in the wild, just without the atomics. Really depends on what you're doing, though. If the individual items are small because they mostly store pointers into some other heap allocated data or just a bunch of numbers, so that the whole map+element structure could fit in the cache, then using multiple maps probably gives better performance.

Can't find any references for this because I just came up with the idea of using multiple maps (not the array storage itself) into a single backing array. Multi-splay trees[3] look close in principle, but they use a reference tree instead of a flat array. It's probably not substantial enough of a difference to really call it a different data structure though.

Edit: Another idea would be a delayed queue splay tree, where some kind of reference to the accessed node gets added to a queue and once the queue reaches a certain size, or some other condition is met, the tree gets locked and all the splaying is done in a single batch. Some operations could even be optimized away if computing that is less costly than doing the operations itself Is. In that case it might actually be useful even in the single threaded case. Somewhere relates to the randomized splaying and probably has been done before though.

[1]: http://www14.in.tum.de/personen/albers/papers/ipl02.pdf

[2]: https://digitalscholarship.unlv.edu/cgi/viewcontent.cgi?arti...

[3]: https://www.cs.cmu.edu/~jonderry/multi-splay.pdf

scottlamb · on Aug 1, 2021

Those techniques sound possible but the path hint approach seems so much simpler to apply than batching splay updates.

Particularly with epoch GC: When there's heavy reading and rare writes, I like using epoch GC so readers never block. Writes then need to create a whole new map. I suppose one could use something like the delayed queue splay tree you're describing, publishing a new map occasionally, and ensuring this doesn't race with other updates. But it'd be a pain to get right when you could just use path hints instead. Is there an overwhelming benefit I'm missing to justify this complexity and the expense of full map rebuilds?

gopalv · on July 30, 2021

> we'll see another explosion in practical CPU throughput, combined with a little help from algs and data structures

The learned index[1] is basically a variation of this sort of "hint" which relies on getting the right answer most of the time, but from learning the distribution function as a model instead of relying on the programmer to heuristic it.

(from that paper)

> The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records

I haven't seen anything mix this sort of work with an Eytzinger traversal, but there's always diminishing returns for these.

[1] - https://arxiv.org/abs/1712.01208

svnpenn · on July 30, 2021

> fibbonaci

ccleve · on July 30, 2021

It's a cool idea, but you have to keep in mind that binary searches are not optimal for b-tree nodes unless the node is really big. If you have a smaller node, < 32 or 64 elements, a simple linear search is faster. The reason has something to do with processor cache, branch mis-prediction, and pipelining.

A linear search can also be optimized to use SIMD, which is a big win.

If you really want to use a path hint, you could just say "start my linear search at the point where I left off last time", and if the first element is too big, wrap around to the beginning. But that might not buy you much.

psykotic · on July 31, 2021

A fast binary search intended for B-tree key array searches does not have branches at all [1]. An B-tree interior node's key array is usually sized to fit in one cache line, so there isn't a difference in cache access pattern.

With 8-byte keys and the usual 64-byte cache line size, you end up with 8 keys per interior node's key array. That's a pretty bad range for the "simple linear search" most people would write, which is going to be dominated by the expected branch mispredict cost. The idea of using a hint to guess a starting position for either linear or binary search has the same issue. You need branches to capitalize on the best case vs worst case spread, and the mispredict cost will dominate for smaller arrays like this.

If you care about speed for this case, use a fast branchless binary search or a vectorized linear search (the latter will have an edge). An additional minor consideration for B-tree traversal is the latency of vector vs scalar loads since you don't know the load address (except for the root node) until you've finished searching the parent node's key array. That is not necessarily true for vectorized searches of small arrays in general, where the vector load addresses may be key independent (like the B-tree root node is here) and so in that case the few extra cycles of L1 load-to-use latency for vector loads (and higher load latencies in general) can often be hidden.

[1] By B-tree I am referring to in-memory B+ trees. A disk-based B+ tree would usually be built around page-granularity nodes. In that regime a branchless binary search for integer/float keys (with software prefetch to populate the cache ahead of the data dependency chains) that finishes off with a vectorized linear search is likely going to be the winner. The B-tree slack interval (M/2, M] helps since it guarantees a fully unrolled binary search with lg(M) steps won't waste any steps. You can still support "underflowing" B-tree nodes with the same branchless code path.

scottlamb · on July 30, 2021

> It's a cool idea, but you have to keep in mind that binary searches are not optimal for b-tree nodes unless the node is really big. If you have a smaller node, < 32 or 64 elements, a simple linear search is faster. The reason has something to do with processor cache, branch mis-prediction, and pipelining.

I've heard that—and iirc Rust's BTreeMap type uses linear search—but does this depend on the type of key? Easy for me to see how that might be true for a BTreeMap<u64, V> but not a BTreeMap<String, V>. In the latter case, there's a pointer to chase for each comparison (and more/variable data to process; SIMD over multiple keys doesn't seem likely).

ccleve · on July 30, 2021

Yes, true. If your comparisons involve pointer-chasing or string comparison or anything complex, then it's not clear which strategy is faster. I always write the simplest possible code first, which is usually a linear search, benchmark it, and then try to beat the benchmark if it's warranted.

nextaccountic · on July 30, 2021

Isn't the binary search here taken as the whole tree, spanning multiple nodes? Each time we follow a pointer we are discarding a huge swath of the tree.

In a single node, a linear search is probably better, yes.

jleahy · on July 30, 2021

The something is data dependencies.

ardit33 · on July 30, 2021

I agree.... the whole point of a b-tree is to already subdivide the dataset in smaller manageable sets. A binary search wont be much help on those slices if the tree is constructed properly.

jldugger · on July 30, 2021

As I understand it, the idea is that a path hint is a suggestion on which node to scan first, not a suggestion on how to binary search an individual node.

jsnell · on July 30, 2021

No, it'll still do a scan all the way from the root node down to the leaf nodes. The path is simply saying which element to start the binary search from, in each of the nodes.

mabbo · on July 30, 2021

I love little optimizations like this.

What I'm reading is that the node in the b-tree remembers the most recently found index during search, and if the next search happens to be for the same value or one very close, the binary search goes much faster. This only helps in situations where searches correlated in time tend to have nearby key values, but that's probably very common.

anonymouse008 · on July 30, 2021

I'm not qualified to truly understand this, but when Lex first interviewed Jim Keller, Jim basically said regarding processor design -- 'yeah, if you just guess the same result as last time, you get a 50% speed increase.'

First Interview (where that poorly paraphrased quote resides): https://www.youtube.com/watch?v=Nb2tebYAaOA&t=13s

Second Interview: https://www.youtube.com/watch?v=G4hL5Om4IJ4

sitkack · on July 30, 2021

The Weather Prediction Algorithm, weather will be the same as yesterday. Only wrong on transitions, very useful when you have runs of the same state. After that you implement a Markov Chain [1]

https://en.wikipedia.org/wiki/Markov_chain

ot · on July 30, 2021

The C++ STL has something similar, insert operations accept an iterator as hint:

https://en.cppreference.com/w/cpp/container/map/insert

In common implementations iterators are just node pointers, which is enough because nodes have parent pointers, so you can recover the whole path.

In extreme cases like ordered insertion this effectively shaves off a log(n) factor.

uyt · on July 30, 2021

I really like that the hint is optional.

You don't always want the hint because if you're jumping d steps ahead, you actually need to touch 2 log(d) nodes, once to walk up, and another to walk down. This means anytime you're moving more than d=sqrt(n) steps away you're better off just walking down from the root instead.

For example if n=1000000, if the hot section is wider than d=1000, you're better off going from root all the time. Of course if you access stuff that are directly adjacent you should still use inorder iteration for those (but I see that as a ++ operator, not a separate lookup).

ntonozzi · on July 30, 2021

A similar idea is finger search (https://en.wikipedia.org/wiki/Finger_search), where you can turn a traversal of a tree from O(log(number of items in the tree)) into O(log(number of items between the finger and the key)).

namibj · on July 30, 2021

A related thing: exponential search. It's the sorted-array equivalent.

uyt · on July 30, 2021

I was going to link the same thing except I would say he is talking about exactly that without knowing the academic name for it.

bob1029 · on July 30, 2021

I think you would find the splay tree provides similar properties, potentially in an even more elegant way.

For applications I have used it for, this technique results in the hottest data eventually being condensed into a smaller number of contiguous storage blocks.

The power of this technique is that it amortizes the optimization of the tree across all access (reads), not just inserts & deletions (or other special-purpose re-balancing operations).

jsnell · on July 30, 2021

I don't think this is about hot-data optimization as such though, but about very short-term local clustering of keys.

I.e. it's not that a:b:c and a:b:d are globally accessed more often than x:y:z, but that after a:b:c is accessed it's more likely that the next access is a:b:d than x:y:z.

crdrost · on July 30, 2021

It's also that they are different data structures used historically for different things, no?

It'd be interesting to see a high-fanout splay tree, not even 100% sure how it'd work. Do you splay just the one node up to the root and have to split/combine each splay? Or do you only splay the first node in a "chunk" up? Or the whole "chunk"? Is it better to do a B+ type of thing rather than a B-tree type of thing?

bob1029 · on July 30, 2021

In theory, this could provide dramatic performance uplift if the splay is done along lines of an aggregate value function over the node contents, and the contents of the node are selected such that their aggregate value function is well-defined & ordered relative to other nodes. I've got an implementation I might try this with. There are a ton of directions you could take this in depending on specifically what you are trying to optimize for.

Looking at this from a mechanical sympathy perspective - If you are unable to batch tree operations for whatever reason, increasing fanout is a good intermediate answer since you can make your storage system do more effective work per I/O. You can't ever read or write in units smaller than 1 storage block, so maybe shoving more logical items into each block is where you need to look.

Practically, this is probably not feasible in a wide number of use cases. GUID/UUID keys (non-linear) would break this scheme in my mind.

eternalban · on July 30, 2021

The tree structure is invariant across the balancing/search algorithms. Maybe you can do splay on loads and then rebalance. Q is whether this ends up being more costly over time. Now I wonder if there can be a hybric splay with distinct sub-roots.

foobiekr · on July 30, 2021

I love splay trees, skew heaps, etc. data structures that get very little love.

But I’ve actually tried to use splay trees in production code. The performance benefits versus a simple Patricia tree seemed totally absent in actual practice even for things like very frequent adjacent queries. The structure is neat but paying attention to memory layout and cache line size is far, far more important.

bob1029 · on July 30, 2021

I use splay trees for their indirect utility. That is, the splay tree is a trivial way to calculate minimal, incremental updates to an append-only log. With most (all?) other b-tree structures, you would need to rewrite an unsustainable (or otherwise unpredictable) amount of tree data to the append-only log each time it is modified. The magic is that the splay operation redefines the root node every time you access the tree, so you just have to keep track of all the nodes you visited in each transaction batch and then write that sub-tree to the end of the log.

>paying attention to memory layout and cache line size is far, far more important.

Totally agree. For scenarios where I know I am going to have fewer than ~100k of something (assume an array of structs), I typically just do a linear scan over memory. If its a bunch of objects in memory or a cold access to storage, the trees become far more practical.

thomastay · on July 31, 2021

Hey, I'm really curious about how that works. Could you go into a bit more detail into that?

From what I can gather on your comment, it seems that if you are given a splay tree, and a batch of transactions i.e.

  GET 1
  SET 10 "hello"
  GET 15

You would write all the visited nodes, in order? But splaying involves a lot of tree operations and rotations too. I don't get how that works.

bob1029 · on July 31, 2021

> But splaying involves a lot of tree operations and rotations too. I don't get how that works.

Correct, but the whole idea is that over time the same subset of hot nodes is going to trickle to the top of the tree, so you are just rewriting node offsets in your temporary list as you process the batch. A node may be logically modified 100x per batch, and only written to storage once.

zinclozenge · on July 30, 2021

One of the things that surprised me was that linear search on optimally laid out data was frequently faster than binary search with frequent pointer chasing.

vvern · on July 30, 2021

It feels to me like this just a kludge added to deal with a lack of a stateful iterator on top of the tree. In this use case, the author indicates that it is about exploiting locality in a request path. Imagine that the tree offered an iterator that maintained a stack of its path and that thing had a seek method. You can optimize that thing to only go back up to the root if it needs to. Stateful iterators make for a nice pairing with btrees. These other hints are papering over the lack of that abstraction as far as I can tell.

anderskaseorg · on July 30, 2021

With a stateful iterator, you might need to worry about your cached pointers being invalidated by mutations from other callers.

Ar-Curunir · on July 30, 2021

Not to sound like a member of the Rust Evangelism StrikeForce, but Rust is able to offer stateful iterators over its `BTree` exactly by preventing mutation during iteration.

oscardssmith · on July 30, 2021

Doesn't that mean that iteration requires a write lock? That sounds bad for lots of applications.

Ar-Curunir · on July 30, 2021

The write-lock is checked entirely at compile time, so no performance overhead

hedora · on July 31, 2021

It sounds like that will prevent threads from mutating the tree in the (potential) presence of an iterator in some other part of the tree, which is prohibitively expensive for many applications.

Am I missing some clever trick?

binarybanana · on July 31, 2021

No, that is correct.

There are ways to avoid locks, but the stdlib implemtation doesn't do it. There are probably a bunch of crates that implemented concurrent BTrees, though.

jeremyis · on July 30, 2021

I really enjoyed this post and hope I get more like it in the future:

- It gave a high level, concise background overview of the base concept (B-trees) complete with diagrams

- It talked from first principles about how things worked without really any Jargon

- It explained simply a clever optimization

Thanks for posting!

Symmetry · on July 30, 2021

I'm sort of surprised that the gains are so high. I'd tend to expect the cost of a pointer deference in a large data structure to be bigger than the comparisons even with the attendant branch mispredictions.

EDIT: Oh, duh, if the path hints are useful then cache locality means that the data you're looking for is already nearby. If the hint works the cost of the lookup will actually be dominated by the comparisons.

srcreigh · on July 30, 2021

I'm not sure this is useful for a traditional database. There are lots of costs that are orders of magnitude more important.

This is a CPU optimization, however in a DB the major costs are 1) pulling blocks from disk 2) transferring results over a network.

Even if you assume nodes are cached in memory, a B-Tree node for a database probably fits in L1 cache. SQLite page size is 4098 bytes by default vs 8KB+ L1 cache size. Shouldn't you pay 100x the cost just to load the next node from memory, if you have a large tree? In theory that's <1% faster overall.

I'm curious whether 3x improvement is overall, or just from the binary search. (according to the benchmark it is overall 2x faster - very curious).

Also curious the size of the tree in benchmarks and the spread factor (ie how many children each B-Tree node has). I couldn't find this info from the repo. This could affect the % of time you can find the next node already in a cache.

srcreigh · on July 30, 2021

I found the answers - the tree uses 256 children by default, implying ~16kb node size on 64 bit architecture. The size of the tree in the benchmark is 1M nodes, so ~16GB tree. Pretty hefty tree! :)

It's the seqential insert/get benchmarks which benefits the most from path hints. You'd only have 1 hot path down the tree in this setup (the rightmost path). Is it possible the CPU is caching multiple non-contiguous blocks (ie the hot path)? That would explain why memory access doesn't have as much of a factor as I originally thought.

In other words - this technique may be useful in DBs! Disk and memory are both heavily cached in the CPU, so CPU optimizations can help for sequential/repeated local accesses

tidwall · on July 30, 2021

It’s a 1M items, not nodes. Each item is a Go interface{}, about 16 bytes each. the benchmarks actually show that the tree is around 36MB.

infogulch · on July 30, 2021

Another B-tree-related post I appreciated recently: Evolution of tree data structures for indexing - https://news.ycombinator.com/item?id=27963753

I'd be interested in more posts on LSM trees.

kazinator · on July 30, 2021

An interesting idea would be to imitate filesystems and provide an actual "current working node" concept to the application. The current working node is a hint which contains a pointer to a BTree node. This could be the leaf node where some item was previously found, or one node higher above the leaf level.

When the search for an item fails from the current working node, it falls back on earnestly searching from the root.

If a node is split or merged by insertions or deletions, then if any current working node pointers to it exist, they get invalidated or updated somehow. (The B-Tree object could carry a small dynamic set of current working node items which it can update when the tree mutates.)

willvarfar · on July 30, 2021

Wow! Obvious in hindsight!

Do any databases use this or similar tricks?

bartlomieju · on July 30, 2021

BuntDB [0] from @tidwall uses this package as a backing data structure. And BuntDB is in turn used by Tile38 [1]

[0] https://github.com/tidwall/buntdb [1] https://github.com/tidwall/tile38

jlokier · on July 30, 2021

Yes. Even better, a good database will avoid traversing the blocks and path, when the key is close enough to the previous lookup's hint, making all but the first lookup in a nearly-consecutive sequence take O(1) instead of O(log N) where N is the size of the tree, and runs of n nearly-consecutive operations take O(n + log N) instead of O(n log N).

warmfuzzykitten · on July 30, 2021

The "path hint" sounds a lot like a cursor. Since a lot of b-tree traversal is ordered, knowing the last position in the tree is often valuable information.

vecter · on July 30, 2021

In case others are unclear since it doesn't seem to be explicitly stated in the docs, you update the path hint to be the path of the most recently found item. I think this is implied when he says "My path may be wrong, and if so please provide me with the correct path so I get it right the next time.", but you can see where `hint` is set in the implementation also.

nomy99 · on July 30, 2021

I think OP is using the wrong data structure, and his fix is bringing him closer to the ideal solution. You need a hashmap function that returns a search range to direct the binary search on the B tree. This lets you add logic on how you want data to be searched in the future. In base case it will work as a the B tree hint

kolbe · on July 30, 2021

I love this. In general, I've found you can do amazing thing to speed up common algorithms by tailoring them to your use case. It usually feels hacky, but I've found it to be very effective.

sroussey · on July 30, 2021

Back in the 1990s, I did something similar with SQL and it worked wonders.

iratewizard · on July 30, 2021

I wonder how this would compare to a B-Tree where you've explicitly loaded large branches of the tree into cpu cache and perform multiple comparisons in a single cpu cycle.

kache_ · on July 30, 2021

I love b trees :)