The way I see it, Forth is the architecture-independent assembly that is optimized for ease of bootstrapping on new hardware at the cost of efficiency.
With a threaded-code interpreter, yes, Forth's overhead is considerable, but with a native-code Forth compiler, performance is generally only slightly worse than C.
This is presumably because modern optimising C compilers are incredibly sophisticated, whereas modern Forth engines tend to be essentially one-man projects, even the commercial ones.
> performance is generally only slightly worse than C
I've built a couple of projects both in C and Forth, the latter usually to prototype and once I had a good understanding of the problem space the final version in C. Typically the C version outperformed the Forth version by 3:1 or better, and I would not have known how to bridge that gap.
Well written C is quite performant and Forth's overhead (which is relatively small, but it is there) translates into Forth's threaded interpreter executing a JMP instruction at least once per Forth word and that alone is probably enough to explain a good chunk of that gap. FWIW, my experience with Forth is limited to one implementation and a whole bunch of applications (after being introduced to the language by a veritable Forth wizard in 1985 or so, Hi, LS). Even the code of experienced Forth programmers was no match for my then 6 year old experience with C. Nowadays with far larger caches Forth might do better, I haven't really worked with it for years.
With respect you've ignored the point I was making. There exist several Forth engines with native code-compilation, for instance VFX Forth, SwiftForth, and iForth.
> Typically the C version outperformed the Forth version by 3:1 or better, and I would not have known how to bridge that gap.
With a threaded-code Forth interpreter I'd expect the C version to outperform it by something closer to 5:1, so 3:1 doesn't sound too bad. The only way you can close the gap is with good quality native-code compilation.
> Nowadays with far larger caches Forth might do better, I haven't really worked with it for years.
It's interesting how advances in CPU architecture change the relative performance of the different threading strategies. This has been nicely studied by the gforth folks. [0][1] Threaded-code interpreters still easily lose to optimising native-code compilers though, [2] and I expect they always will.
More on how Forth collides with low-level CPU matters: [3][4][5]