In a simple benchmark, creating a thread & waiting on it in a loop (pthread_create+pthread_join) takes 0.03ms = 30000ns, so running a bunch of constructors of a bunch of libraries could actually be a non-insignificant amount of time.
Do agree that threadlocal read speed should be a lot more important, but it'd quite suck if the "acceptable performance" thread usage approaches would vary dramatically depending on what libraries you've (or something else) has unrelatedly loaded. (though maybe other overheads might appear from other sources similarly, keeping this not the most important slowdown, idk)
If you make use of dynamic linking namespaces and some sort of runtime, you can make it impossible to enter specific libraries in certain libraries in certain threads and avoid that cost. If your application is large enough that you can't keep track of deps, then you may want something like that in your application if you're not willing to split it into multiple processes.
Do agree that threadlocal read speed should be a lot more important, but it'd quite suck if the "acceptable performance" thread usage approaches would vary dramatically depending on what libraries you've (or something else) has unrelatedly loaded. (though maybe other overheads might appear from other sources similarly, keeping this not the most important slowdown, idk)