RISC-V Vector Extension for Integer Workloads: An Informal Gap Analysis

janwas · on Nov 9, 2024

Great writeup! One comment: PSHUFB does seem important, rather than just "nice to have".

> This should be mostly mitigated because the base V extension only has the full lane-crossing-gather, hence a lot of software will use lane-crossing operations and force vendors to implement it competitively.

Is that actually what's happening? From the discussion you link (https://github.com/llvm/llvm-project/pull/104574), it seems like the X280 behavior is currently still widespread. Do we know when the competitive perf might materialize?

camel-cdr · on Nov 9, 2024

It's just the X280 and some of the open-source long vector designs that have 1-cycle-per-element vrgather AFAIK. They X280 is more of an accelerator than actual end user core. All other application class cores I know of have a "fast" LMUL=1 vrgather.

The linked discussion was about LMUL>1, which doesn't matter because you can unroll it into multiple LMUL=1 vrgathers if you don't cross lanes. That even uses fewer vector registers. LMUL>1 vrgather should imo only be used if you need to actually cross vector register lanes.

I think a fast 128-bit lane shuffle is certainly unnegotiable for application class processors, but I'm not sure a dedicated instruction is needed for them, because for them a fast LMUL=1 vrgather is similarly important. You won't see many processors with larger then 512 VLEN running regular software from the binary ecosystem and if they do, they'd better execute a LMUL=1 shuffle quickly, especially if it's within 128/256/512-bit lanes, whatever your internal execution unit width is.

janwas · on Nov 9, 2024

Good to know. There is more to shuffle instructions than just their speed, though. Arm TBX leaves out-of-bounds elements unchanged, and x86 PSHUFB can also zero the result and implicitly clear bits 4..6. Some SW does make use of that, so it seems unfortunate that RVV doesn't get either of those without extra effort.

camel-cdr · on Nov 9, 2024

It's probably close enough, vrgather zeros when out-of-bonds and has a masked variant that leaves unmasked elements undisturbed.

janwas · on Nov 10, 2024

Right, though the out of bounds depends on VL, which makes it less generally useful than a known 128-bit limit. It also requires at least one extra instruction to get a mask.