Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Multiprocessing is great. But then every process keeps its own copy of hundreds of gigabytes of stuff. May be okay, depending on how many processes you spawn.

If the bulk of the data is immutable (or at least never mutated), it can be safely shared though, via shared memory.



> every process keeps its own copy of hundreds of gigabytes of stuff. May be okay, depending on how many processes you spawn

That depends on how you're using multiprocessing. If you're using the "spawn" multiprocessing-start method (which was set to the default on MacOS a few years ago[1], unfortunately), then every process re-starts python from the beginning of your program and does indeed have its own copy of anything not explicitly shared.

However, the "fork" and "forkserver" start methods make everything available in python before your multiprocessing.Pool/Process/concurrent.futures.ProcessPoolExecutor was created accessible for "free" (really: via fork(2)'s copy-on-write semantics) in the child processes without any added memory overhead. "fork" is the default startup mode on everything other than MacOS/Windows[2].

I find that those differing defaults are responsible for a lot of FUD around memory management regarding multiprocessing (some of which can be found in these comments!); folks who are watching memory while using multiprocessing on MacOS or Windows observe massively different memory consumption behavior than folks on Linux/BSD (which includes folks validating in Docker on MacOS/Windows). There's an additional source of FUD among folks who used Python on MacOS before the default was changed from "fork" to "spawn" and who assume the prior behavior still exists when it does not.

This sometimes results in the humorously counterintuitive situation of someone testing some Python code in Docker on MacOS/Windows observing far better performance inside Docker (and its accompanying virtual machine) than they observe when running that same code natively directly on the host operating system.

If you're on MacOS (not Windows) and wish to use the "fork" or "forkserver" behaviors of multiprocessing for memory sharing, do "export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES" in your shell before starting Python (modifying os.environ or calling os.setenv() in Python will not work), and then call "multiprocessing.set_start_method("fork", force=True)" in your entry point. Per the linked GitHub issue below, this can occasionally cause issues, but in my experience it does so rarely if ever.

1. https://github.com/python/cpython/issues/77906

2. https://docs.python.org/3/library/multiprocessing.html#conte...


Is what you're describing only true of the "Framework" Python build on MacOS? It sounds like that's the case from a quick read of the issue you linked. I would say that people should basically never use the "Framework" Python on MacOS. (There's some insanity IIRC where matplotlib wants you to use the Framework build? But that's matplotlib)


> Is what you're describing only true of the "Framework" Python build on MacOS?

No. This behavior is present on any Python 3.8 or greater running on MacOS, enforced via "platform == darwin" runtime check: https://github.com/python/cpython/pull/13626/files#diff-6836...

You can check the default process-start method of your Python's multiprocessing by running this command: "python -c 'import multiprocessing; print(multiprocessing.get_start_method())'"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: