The first thing you should do is profile the code (py-spy is my preferred option) and see if there are any obvious hotspots. Then I'd actually look at the code, and understand what the structure is. For example, are you making lots of unnecessary copies of data? Are you recomputing something expensive you can store (functools.cache is one line and can make things much faster at the cost of memory)?
Once you've done that, then you should be familiar enough the code to know which bits are worth using multiprocessing on (i.e. the large embarrassingly parallel bits), which if they are a significant part of your code should scale near linearly.
The other thing to check is which libraries are you using (and what are your dependencies using). numpy now includes openblas (though mkl may be faster for your usecase), but sometimes you can achieve large speedups though choosing a different library, or ensuring speedups are being built.
Once you've done that, then you should be familiar enough the code to know which bits are worth using multiprocessing on (i.e. the large embarrassingly parallel bits), which if they are a significant part of your code should scale near linearly.
The other thing to check is which libraries are you using (and what are your dependencies using). numpy now includes openblas (though mkl may be faster for your usecase), but sometimes you can achieve large speedups though choosing a different library, or ensuring speedups are being built.