Verifying Vectorized Rust Revisited

jart · on May 16, 2021

SSE SIMD instructions should be unencumbered by patents at this point. So it's worthwhile to "bite the bullet" rather than do something like -mgeneral-regs-only. SIMD opcodes frequently give me 10x or even 54x speedups for the functions I write.

I know this isn't Rust, but if anyone needs a more readable explanation of what the SIMD opcodes actually mean (i.e. more readable than intel manual) then check out the cosmopolitan codebase. I implemented all of them in ANSI C and called it ANSITRINSICS just so I didn't have to use Intel's ugly intrinsics API and wouldn't have to feel guilty about muh portability. It's well-tested and fuzzed against the real thing, so I hope it can serve as excellent documentation to anyone who wants to understand the great mystery of vector instructions: https://github.com/jart/cosmopolitan/tree/master/libc/intrin

A good example is PUNPCKLBW which is such an important instruction but a name like that is very intimidating to someone who hasn't rote memorized its meaning. I think ANSI C does a much better job explaining what these instructions do than the pictures in the Intel manual do. https://github.com/jart/cosmopolitan/blob/48645651983ca46a31... Particularly when it comes to ones like PMADDUBSW which is without a doubt the greatest of the SSE instructions. https://github.com/jart/cosmopolitan/blob/48645651983ca46a31...

In many cases if you compile these ansitrinsics with -ftree-vectorize you'll notice that the compiler recognizes the ANSI C code boilerplate and compiles it down into the underlying simd instruction! I think GCC and Clang should be hard coded to recognize all of them. That way we can write truly portable code that's fast too.

nwmcsween · on May 16, 2021

Trying to understand what does SIMD intrinsics offer in a 'libc' that SWAR couldn't do? For absoultely massive values to strlen, etc sure but IIRC the average strlen was something like 10 or less.

jart · on May 16, 2021

In the case of strlen, SSE goes a few cycles slower for n<8, gains a slight advantage over SWAR for n>8, and starts going ~twice as fast for n>255. Benchmarks: https://justine.lol/swarsimd.txt The difference is probably more pronounced in functions that do block copying like https://github.com/jart/cosmopolitan/blob/master/libc/str/st... (which actually uses the ANSITRINSICS library). For more complicated functions like this: https://github.com/jart/cosmopolitan/blob/b3838173ecbb67fae5... SIMD has the advantage that figuring out how to do the SWAR arithmetic correctly might be more difficult than figuring out how to use the hardware API. It goes without saying that the most compelling benefits of SSE are the instructions like PMADDUBSW which can be used to great effect for signal decimation https://github.com/jart/cosmopolitan/blob/b3838173ecbb67fae5... assuming you use The Perfect Kernel {-1, -3, 3, 17, 17, 3, -3, -1} http://www.johncostella.com/magic/ or PADDSW which can mix audio samples real fast doing saturating arithmetic. You can't SWAR that. Another great example is carryless multiplication instructions which literally make CRC32 computation go 54x faster.

jdonaldson · on May 16, 2021

Thank you for this, you're doing the Lord's work.

ZeroGravitas · on May 16, 2021

What's the general process for proving to yourself that your hand-coded vectorization produces the same output and is faster than a more straightforward implementation?

Seems like if you have such an implementation then it can be used for multiple tasks, running on weird platforms, checking to see if autovectorization is good enough for your current use case, automated testing of the optimized routines against a slower, simpler version and this verification process is just one more thing that would be able to take advantage of this. Yet it would seem that this generally doesn't exist or at least there's no standard way to do it, since they propose adding a standard compiler flag to disable and optimization but currently need to creatively work around the lack of that.

Do people generally assume it's going to be faster and just jump into the vectorized stuff and treat that as the gold standard with no plan for reversing or revisiting that decision later? Seems like a good bet in the short term, but maybe a bad bet in the long term.

Or is this all secretly a plot by chip manufacturers to create lock-in by writing fast but platform specific libraries?

burntsushi · on May 16, 2021

> What's the general process for proving to yourself that your hand-coded vectorization produces the same output and is faster than a more straightforward implementation?

I spent the first few months of this year doing exactly that for a memmem implementation: https://users.rust-lang.org/t/memchr-2-4-now-has-an-implemen...

In short, I:

* Ensure I have a non-vector implementation.

* Define a large suite of benchmarks, and ensure that I test the non-vector implementation: https://github.com/BurntSushi/memchr/tree/master/bench

* Have a thorough testing strategy, including hand written tests, exhaustive tests, property based tests and fuzzing: https://github.com/BurntSushi/memchr#testing-strategy

> Yet it would seem that this generally doesn't exist or at least there's no standard way to do it

In the context of hand writing explicit vectorized code, it's not that hard to make sure there's a way to disable the vectorized path. The issue in the OP is that there's no standard way to do that for everything everywhere. While having such a standard way to do it would be useful for testing and reduce some friction in the process, it's not a Huge Deal.

> Do people generally assume it's going to be faster and just jump into the vectorized stuff and treat that as the gold standard with no plan for reversing or revisiting that decision later? Seems like a good bet in the short term, but maybe a bad bet in the long term.

No? Vectorization usually requires some kind of platform or ISA extension specific stuff. So if you have it, it's usually conditional and there's usually a "slow" path.

In many cases, the perf benefits of vectorization are really really really clear.

> Or is this all secretly a plot by chip manufacturers to create lock-in by writing fast but platform specific libraries?

No? There do exist various SIMD abstractions that present a common interface to vector instructions across many targets. In cases where I can use that, once std::simd is a thing in Rust, I will definitely do so. But the SIMD abstractions generally won't expose everything---just the things in common across targets.

Klasiaster · on May 16, 2021

The RVT project does a really fascinating mix of ground work and plumbing to make it useful. I like the idea of being able to turn randomized testing into verification. Having different backends for cargo-verify and the program annotations is also a nice way of unifying the various projects and still allowing for innovation to happen in a modular manner.

l- · on May 16, 2021

Seems like it would have been more effective to write a verifier for x87 rather than an emulator.