Show HN: 26x speedup on BitNet sparse ops with AVX-512 and 2-bit encoding

I've been optimizing ternary operations for BitNet 1.58b and found significant overhead in the current implementation.

I wrote a dependency-free C kernel (sparse-ternary-fma) using 2-bit encoding and AVX-512 instructions.

Benchmarks on Intel Xeon (N=4096):

Throughput (Dense): 2.38x faster (8.21 GFLOPS vs 3.45 AVX2)

Throughput (Sparse 80% zeros): 26.12x faster (23.25 GFLOPS vs 0.89 Scalar)

Memory: 4x denser (2-bit vs 8-bit standard)

This approach packs 4 trits per byte and leverages sparsity-aware FMA to skip zero-valued weights, which is critical for 1.58-bit quantization efficiency.

PR is pending on the Microsoft BitNet repo. Code is open source here:https://github.com/microsoft/BitNet/pull/365