If I remember correctly, local-dynamic is relevant if you access multiple thread-local variables, as the offset between them will be constant. The visibility attributes should allow the compiler to switch from global-dynamic to local-dynamic automatically. It's been a bit since I looked at all of this.
Also note that these sequences are highly architecture dependent and cost as well as cost differences will vary e.g. for ARM or ARM64.
I updated the post with the point on visibility; in my tests inspired by https://lobste.rs/s/b5dnjh/0_0_0_c_thread_local_storage_perf... I see that clang improves codegen thanks to hidden visibility but g++ does not, and that comment says clang doesn't do it on all platforms.
I stick to the broad conclusion that thread_locals without constructors linked into an executable rather than a shared library are the fastest and most performance-portable by far, but the visibility point is very worth mentioning.
Also note that these sequences are highly architecture dependent and cost as well as cost differences will vary e.g. for ARM or ARM64.