But with virtual memory it wouldn't be much of a cost at all (..on 64-bit systems, that is; things are more sad on 32-bit if one cares about those). Just some kernel-internal data structure configuration to ensure that future memory page allocations don't overlap this one.
dlopen is a requirement for importing native libraries in non-compiled languages; and, regardless, I as a library author don't get to choose whether users will avoid using dlopen and so have to assume worst-case.
Virtual memory still costs, you know, something like 0.2% of virtual memory space in page table entries. 1 GB of VMA per thread is 2MB of real RAM cost per thread. And there's absolutely no need for that kind of space use -- the thread-local variable can just be a pointer to a heap-allocated large object.
In addition to the ways that page table entries can be avoided, the system can use large pages for all the areas you aren't using yet, cutting the overhead to 4KB.
Yeah, a gigabyte is most likely extremely overkill indeed, a megabyte or so would be plenty; though the goal would be to get threadlocals to be able to be as arbitrarily large as non-initial-exec threadlocals so it wouldn't break anything ever.
I don't know how the kernel manages it internally, but there's no need for PROT_NONE preallocated virtual memory to be mapped to actual CPU-accessible pages at least; and `mmap(NULL, 1ULL<<46, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0)` takes ~4 microseconds to map 64 terabytes of virtual memory so it's definitely not 0.002x overhead. (perhaps the overhead amount changes depending on how close to a page level the size is, but it shouldn't be too much regardless)
This'd essentially be turning the preallocated TLS space as a memory allocation arena (and you could actually even just choose to provide an alloc+free interface for programs to dynamically allocate fs-relative-offsets to use for custom threadlocals?).
(then there's general problematicness of virtual memory; such PROT_NONE never-touched memory still counts towards virtual memory usage, which is annoying; browsers/Java/etc already suffer from this, but it'd be rather ugly for literally all processes to have such. I'd quite like a memory usage counter that includes all memory that is or ever was writable, but not PROT_NONE never-touched; i.e. how much memory the process can eventually require without running explicitly requesting more via syscalls, but afaik such just doesn't exist, or at least isn't a standard-displayed thing)
> how much memory the process can eventually require without running explicitly requesting more via syscalls
This concept is called "commit charge". Windows MM models it explicitly. Linux ought to as well. I agree it's a more useful concept than just address space allocated.
Interesting! Some searching later, looks like htop's DATA/M_DRS counter (i.e. second-to-last number in /proc/<PID>/statm) appears to count something related-ish; i.e. doesn't count a PROT_NONE mmap, but does a PROT_READ|PROT_WRITE untouched one; nothing in statm appears to count untouched writable MAP_SHARED though, though (potentially?-)shared mappings do get complicated in general.
Some more experimentation later, it seems to be more like just counting PROT_WRITE+MAP_PRIVATE mappings or so; i.e. mprotect(PROT_NONE)ing (or even just PROT_READ) a writable region results in it not being counted, even if the region was modified and thus must actually be persisted. So it can actually get meaningfully lower than RSS. :/
dlopen is a requirement for importing native libraries in non-compiled languages; and, regardless, I as a library author don't get to choose whether users will avoid using dlopen and so have to assume worst-case.