Hi, love the article. You mention in the article that a hardware mechanism for t...

Sesse__ · 2025-02-17T18:38:14 1739817494

Intel PT is indeed useful (although very, very slow compared to regular sampling profiling), but there's hardly any CPUs that actually implement PTWRITE. (IIRC there's some obscure Xeon or something?)

Typically you get a cycle count every six branches, give or take.

yosefk · 2025-02-17T18:46:32 1739817992

Sampling profilers are indeed very low-overhead, however they can't help debug tail latency, for which tracing profilers are indispensable:

https://yosefk.com/blog/profiling-in-production-with-functio...

https://danluu.com/perf-tracing/

Regarding the slowdown - magic-trace reports 2-10% slowdowns which IMO is actually fine even for production (unless this adds up to a huge dollar cost, for most people it won't) since in return for this you are actually capable to debug the rare slowdowns which are the worst part of your user experience.

However, the hardware feature that I propose (https://yosefk.com/blog/profiling-in-production-with-functio...) would likely have lower overhead since it relies on software issuing tracing instructions, eg at each function entry & exit (rather than any control flow change), and it could be variously selective (eg exclude short functions without loops; and/or you could configure the hardware to ignore short calls. BTW maybe you can with Intel Performance Trace, too, I'm just not really familiar with it.)

yosefk · 2025-02-17T18:40:24 1739817624

I discuss Intel Performance Trace in the writeup where I propose my much simpler hardware support for tracing: https://yosefk.com/blog/profiling-in-production-with-functio...

Like I said there, I'm frankly shocked that all CPUs haven't raced to implement similar features, that magic-trace which is built on top of Intel Performance Trace isn't used more widely, and that developers aren't insisting on running under magic-trace in production and requiring to deploy on Intel servers for that purpose.

The extension I propose is much simpler, and seems similar to what PTWRITE would do if it was the only feature in Intel Performance Trace. I have a lot of experience in chip architecture, and I believe that every CPU maker and every chip maker can support this easily - much more so than full feature parity with Intel Performance Trace. I hope they will!

nwlieb · 2025-02-17T18:57:34 1739818654

One concern with PTWRITE is that it is somewhat "slow," at least according to this: https://community.intel.com/t5/Processors/Intel-Processor-Tr...

I wonder if this is a general issue relating to memory ordering or out-of-order execution, or whether this can be implemented more efficiently in a different extension.

Thank you for the linked article! Agreed on the huge potential for using these tools in production. The community could definitely benefit (even indirectly) by pushing for this kind of instruction set more widely.