I would love to see an IXIA or Spirent line rate RFC2544 test with an imix packet sizes. Most servers and OS stacks cannot do line rate 100G/bits today without a ton of tweaking. Netflix has a few blogs on it. In my current company we moved our NAS to 2x100G but the build servers at 10G and 25g (Ubuntu) cannot max a link in their normal operation (which includes pulling 4g VMs). There is some low hanging fruit here tweeting wise we can do but..
The NAS is TrueNAS with SSD and all the ZFS goodness, ZIL, etc. I can built hand crafted iperf streams between it and another 100g server that hit line rate (including overhead).
The reason for the 100g is there are racks and racks of BS that pull down their images so aggregate we can hit 100g with a fully non-blocking Clos fabric.
You're kind of mixing different things here. Endpoints (like Netflix or your NAS and build server) have completely different characteristics than switches/routers. Likewise, interrupt-based kernel-based networking is very different and ~10x slower than polling-based kernel-bypass. Doing 10G routing per core using polling was demonstrated years ago so 100G really isn't an extraordinary performance claim.
> The reason for the 100g is there are racks and racks of BS that pull down their images so aggregate we can hit 100g with a fully non-blocking Clos fabric.
It may not pan out in practice, but in theory isn't that the perfect usecase for p2p image distribution?
As an alternative, back in the mid-2000s there was a multicast system image distribution tool called flamethrowerd. It was used to netboot clusters of Linux & Mac machines. Probably abandoned, but still available for download.[1] I wonder what happened to that tool, or if there are any good modern replacements in common use. (It was a single purpose UDP multicast file distribution tool.)
This is the first thing I noticed! I guess the reason is maybe just that the creator used to live in Sweden (as indicated in the comments below), but I would like to believe this some inside joke among the Swiss workers. As a Swede, I find it hilarious. :)
edit: Looking at the older threads linked above I found this:
Luxembourg is .lu, but the language code is LB. It causes all kinds of funny bugs when people set their language code to 'lu' and end up with things like date strings automatically translated to Luba-Katanga, a very melodic and completely different language spoken in the Democratic Republic of the Congo.
Interesting, thanks for sharing. It's always nice to see Lua in the wild.
This headline immediately reminded me of this[0] talk from CCC 2018 comparing different high level languages for writing userspace network drivers.
We demonstrated 1 Tb/s back in 2017 on a high-end dual-socket Broadwell server, and the performance increased steadily since then (thanks to both HW and SW).
It is hard to compare w/o knowing packet sizes (64-bytes? IMIX? 1500-bytes?), CPU type, #cores, workload (L2 patch? L2 switching with MAC-learning and 1M entries? IPv6 forwarding with 500K routes? etc.)
This is why we have the FD.io CSIT project hosted by LinuxFoundation where we can compare lots of interesting, automated, reproducible benchmarking scenario in a controlled environment on various HW platforms: https://docs.fd.io/csit/master/report/
C probably won't give you much, difference. Assembly, carefully tweaked, might - the kind of stuff like Quake issuing FDIV every 16 pixels in software renderer, where the assembly intimately matches various aspects of microarchitecture.
Despite years of propaganda, C is not well matched to CPUs currently in use (quite opposite in places), and the typical optimizations don't necessarily work when dealing with external I/O that you need to do in a switch.
Essentially, even if you write in C, reaching higher speeds will involve using "assembly masquerading as C" rather than depending on compiler optimizations.
Also, Snabb uses LuaJIT, which already generated quite tight code, so the performance gap that I suspect some imagine just isn't that wide.
>C is not well matched to CPUs
+
>depending on compiler optimizations
+
>Snabb uses LuaJIT .. quite tight code
==
You can write great C-based systems and avoid assembly if you a) know, always, what your compiler is doing and b) know, always, what your CPU is doing...
My point was that the same optimizations that made C fast break down when you need to take into account often intricate dance between CPU caches, memory, I/O bus etc. - so that unless you go into cpu-model-specific assembly tweaks just using C might not bring you as much benefit.
Is it possible to get better? yes. Would it count as "normal C"? I would say not really (if we say yes, then CL code on SBCL with huge custom VOPs counts)
I'd be interested to learn about significantly complex, nontrivial systems of say 100K to 1M LOC scale that require reasoning about every single instruction from the perspective of every other instruction, in order for the system to work.
You do not do that. Instead you optimize the short, critical part.
Of course it does not apply to everything, you need a few hotspots, but it is quite common: audio/video codecs, scientific computation, games, crypto... And even networking.
I think the answer here is that LuaJIT is fast, and that well written native programs would still be faster, not that C "isn't well matched to CPUs". Modern optimizations are more about memory access patterns than anything else, with SIMD and concurrency beyond that. Focusing on assembly is really not the apex is used to be. For starters CPUs have multiple integer and floating point units, and they get scheduled in an out of order CPU. Out of order execution is as much about keeping the various units busy as it is about doing loads as soon as possible to avoid stalling.
I think if you are going to claim that C or C derivatives aren't actually fast and the idea that they are is due to "propaganda" then you should back that up with something concrete, because it goes against a lot of established expertise.
Currently that user is almost always the administrator (root). BFP is more safe than secure, if that distinction makes some sense.
So it's great because the eBPF verification makes sure that whatever the user loads doesn't blow up the kernel immediately, it's very restricted (no infinite loops, so not Turing complete, can only call predefined API functions, etc). But it still gets JITed into machine code, and still has access to a vast trove of kernel gadgets and data. So it's like a power tool that was designed to be almost impossible to cut your own limbs off with.
But it's still a power tool, it still not something you give to any random passer by.
Original BPF was safe by design (no verifier needed) but slow and far more restrictive, and even then there were exploits over the years. When eBPF and then eBPF JIT became a thing it was entirely predictable what would happen.
The NAS is TrueNAS with SSD and all the ZFS goodness, ZIL, etc. I can built hand crafted iperf streams between it and another 100g server that hit line rate (including overhead).
The reason for the 100g is there are racks and racks of BS that pull down their images so aggregate we can hit 100g with a fully non-blocking Clos fabric.