And how do you remove the character leaving the window on the left? I don’t thin...

DannyBee · on Dec 13, 2022

Yes, I wrote it too quickly you would have to xor characters in and out instead. I doubt it is worth it vs popcnt at that point

It doesn’t change the fact that problem is incremental, and most importantly you can early exit windows as soon as you discover a single non-unique character. You don't have to wait till the end of the window. They don’t implement this for hash set, for example.

Since most (n>1) windows are not unique (well, more accurately the probability of the window being unique decreases as the size of the window approaches the number of possible characters), early exit from windows will likely beat bit munging except for small windows, until you get into SIMD solutions.

rep_lodsb · on Dec 13, 2022

Bit operations are fast, branches are slow.

DannyBee · on Dec 13, 2022

Well, yes, but most of these branches can be if-converted into conditional moves anyway if you want.

So they are data dependencies and not control ones. There are still some control ones.

Beyond that, let me be super clear: Imagine the following six versions:

1. One window at a time processing. No early exit, no bit munging.

2. One window at a time processing. No early exit, bit munging.

3. One window at a time processing. You don't use direct bit munging, but you early exit the window when you hit the non-unique character, and skip the window forward to the first instance of that non-unique character.

IE given "hmma<something>", n=4, you early exit at the second m, and skip processing windows forward to right after the first m (since no window prior to that can be unique, as they will all hit the double m)

4. One window at a time processing, bit munging, same otherwise as #3

5. Sliding window processing, no bit munging

6. Sliding window processing, bit munging.

The speedup between the 1-2 vs 3-6 is much greater than 5 vs 6 and 3 vs 4.

You could probably make 3 pretty darn competitive with 6, and 4 could probably beat 6 with SIMD (IE processing multiple windows in parallel really fast, and maybe wasting work, vs guaranteeing you only do the minimum work to find a unique window)

dahfizz · on Dec 13, 2022

I don't think you are correct. Certainly not with a word size of 4. For very large word sizes, the early return will be a big win. But you are really under-rating how expensive the hashmap will be vs "bit munging".

You could roll your own perfect hash, but you would end up with a solution that looks almost identical to TFA. If you use a stdlib hash, you are going to be chasing pointers all over memory to do a single insert / lookup. De-referencing a single pointer not in cache costs 50 - 100 clock cycles on a modern system. By the time you do one insert, TFA will have XOR'd at least 32 chars into its bit mask. And your cache won't be as nice as in TFA, slowing you down even more.

Given the two scenarios: 1) bit munging, no early return 2) hash lookup, early return

I would guess scenario 1 wins until word size gets above ~256. And obviously, we can add an early return to scenario 1 to make it unquestionably the fastest.