But with virtual memory it wouldn't be much of a cost at all (..on 64-bit system...

loeg · 2025-02-17T19:16:07 1739819767

Virtual memory still costs, you know, something like 0.2% of virtual memory space in page table entries. 1 GB of VMA per thread is 2MB of real RAM cost per thread. And there's absolutely no need for that kind of space use -- the thread-local variable can just be a pointer to a heap-allocated large object.

immibis · 2025-02-17T21:45:35 1739828735

Only if mapped. If not mapped, there's no need for page table entries.

Dylan16807 · 2025-02-19T08:49:39 1739954979

In addition to the ways that page table entries can be avoided, the system can use large pages for all the areas you aren't using yet, cutting the overhead to 4KB.

dzaima · 2025-02-17T19:51:53 1739821913

Yeah, a gigabyte is most likely extremely overkill indeed, a megabyte or so would be plenty; though the goal would be to get threadlocals to be able to be as arbitrarily large as non-initial-exec threadlocals so it wouldn't break anything ever.

I don't know how the kernel manages it internally, but there's no need for PROT_NONE preallocated virtual memory to be mapped to actual CPU-accessible pages at least; and `mmap(NULL, 1ULL<<46, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0)` takes ~4 microseconds to map 64 terabytes of virtual memory so it's definitely not 0.002x overhead. (perhaps the overhead amount changes depending on how close to a page level the size is, but it shouldn't be too much regardless)

This'd essentially be turning the preallocated TLS space as a memory allocation arena (and you could actually even just choose to provide an alloc+free interface for programs to dynamically allocate fs-relative-offsets to use for custom threadlocals?).

(then there's general problematicness of virtual memory; such PROT_NONE never-touched memory still counts towards virtual memory usage, which is annoying; browsers/Java/etc already suffer from this, but it'd be rather ugly for literally all processes to have such. I'd quite like a memory usage counter that includes all memory that is or ever was writable, but not PROT_NONE never-touched; i.e. how much memory the process can eventually require without running explicitly requesting more via syscalls, but afaik such just doesn't exist, or at least isn't a standard-displayed thing)

quotemstr · 2025-02-18T16:16:31 1739895391

> how much memory the process can eventually require without running explicitly requesting more via syscalls

This concept is called "commit charge". Windows MM models it explicitly. Linux ought to as well. I agree it's a more useful concept than just address space allocated.

dzaima · 2025-02-18T18:13:48 1739902428

Interesting! Some searching later, looks like htop's DATA/M_DRS counter (i.e. second-to-last number in /proc/<PID>/statm) appears to count something related-ish; i.e. doesn't count a PROT_NONE mmap, but does a PROT_READ|PROT_WRITE untouched one; nothing in statm appears to count untouched writable MAP_SHARED though, though (potentially?-)shared mappings do get complicated in general.

fweimer · 2025-02-18T19:24:18 1739906658

Is MADV_FREE memory charged? It contributes to classic RSS, but can be discarded by the kernel if it deems that beneficial.

dzaima · 2025-02-18T19:38:02 1739907482

Some more experimentation later, it seems to be more like just counting PROT_WRITE+MAP_PRIVATE mappings or so; i.e. mprotect(PROT_NONE)ing (or even just PROT_READ) a writable region results in it not being counted, even if the region was modified and thus must actually be persisted. So it can actually get meaningfully lower than RSS. :/

MADV_FREE doesn't affect anything afaict.