-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit numpy thread usage for Transformation
classes
#2950
Limit numpy thread usage for Transformation
classes
#2950
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting addition. I might be commenting prematurely but here are initial comments
- I assume that under normal circumstances, multithreading is beneficial, i.e., whenever someone just runs MDA on a multicore machine without thinking about parallelization. Under these conditions we would not want to limit the threads, would we? Is there a sensible way by which we can make the thread limiting optional?
- tests
- docs
- changes
Did you check the performance impact of limiting threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
echoing @orbeckst's comment.
I realise this is linked to MDAnalysis/pmda#144, but it might be good to have an issue on MDAnalysis itself. In part because it would be to go through all the use cases & see where benchmarks vary between users (I wouldn't be surprised if you end up getting a lot of variance on whether or not enabling multithreading speeds/slows things down for a given use case).
.travis.yml
Outdated
@@ -30,7 +30,7 @@ env: | |||
- SETUP_CMD="${PYTEST_FLAGS}" | |||
- BUILD_CMD="pip install -e package/ && (cd testsuite/ && python setup.py build)" | |||
- CONDA_MIN_DEPENDENCIES="mmtf-python biopython networkx cython matplotlib scipy griddataformats hypothesis gsd codecov" | |||
- CONDA_DEPENDENCIES="${CONDA_MIN_DEPENDENCIES} seaborn>=0.7.0 clustalw=2.1 netcdf4 scikit-learn joblib>=0.12 chemfiles tqdm>=4.43.0 tidynamics>=1.0.0 rdkit>=2020.03.1 h5py" | |||
- CONDA_DEPENDENCIES="${CONDA_MIN_DEPENDENCIES} seaborn>=0.7.0 clustalw=2.1 netcdf4 scikit-learn joblib>=0.12 chemfiles tqdm>=4.43.0 tidynamics>=1.0.0 rdkit>=2020.03.1 h5py threadpoolctl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
threadpoolctl isn't a core dependency, so as it is, it will fail when running on minimal dependencies.
I am afraid that with the decorator (the current implementation), we cannot change the thread limit at runtime. (e.g. passing a As for the impact on performance, I can reproduce the results (MDAnalysis/pmda#144) with another 6-core CPU. Sorry for the rough data, but you can see it's mainly the hyperthreading that is limiting the performance. I will create an MDA issue when I have time. |
The decorator is pretty but if it does not provide enough flexibility, just use the context manager: class RotateTransformation:
def __init__(self, rotmat, max_threads=None):
self.rotmat = rotmat
self.n_thread = max_threads # possibly needs some logic here to "do the best thing"
def __call__(self, ts):
with threadpool_limits(self.n_thread):
ts.positions[:] = np.dot(ts.positions, self.rotmat)
return ts That still looks reasonable to me. Or is there a deeper reason for preferring the function decorator? The bigger question is what the default for |
Sorry for the radio silence...was pretty busy last week. I think we can port this function into PMDA if we decide serial code just goes into its default setting. And in PMDA, say if the user is requiring all the cores, then we limit the |
It looks like a worthwhile tuning knob to include in the MDAnalysis transformations. Then PMDA (and anyone else) can just use the optional https://github.com/joblib/threadpoolctl/blob/32037cf43a61909282b5d07e6b21d9621fa03e25/threadpoolctl.py#L154 says that Including |
Do you mean then we don't have threadctl in MDA? Is this because then the user would have to limit the threads when they create the transformations, i.e., they need to know that they will use PMDA?
I can see that it's easier to just apply the limits to everything in PMDA. I don't think that it will be easy to figure out if there's any oversubscription of the CPU, at least not when someone is using distributed. For multiprocessing you could try to find out how many cores are available but this starts getting complicated, I feel.
Can you explain more? You mean so that you can decorate |
@yuxuanzhuang in your performance comparison #2950 (comment), what is the default when you don't limit any threads? That number is missing from the comparison. If the default runs 12 threads (or enough to always reduce performance) then I better understand your point that you want to limit to 1 thread by default. Did you check with |
It turns out there's an issue with this decoration approach that doesn't seem to be easily solvable---the thread limit is applied globally, instead of only the decorated function. Here is an example. https://gist.github.com/yuxuanzhuang/05b0da16a51f567e54f7f3f22591e316 |
The default is 12 threads (or whatever that computer has including hyper threads). I think for |
@yuxuanzhuang apologies, I think I'm missing part of the conversation here 😅 From your test, it seems that the context manager method might work here right? Would that be suitable or is there a barrier to using it? |
You are right, the context manager method should work. I was thinking that with that you have to add the code snippet everywhere needed. |
I'll admit I'm not super versed on the transformation code, but it seems like pretty much everything is a class with both an |
The transformations are any callable that takes and returns a timestep. In practice, callables are not always pickable, hence the use of classes. Since all the transformations we will ship will be such classes, they may as well pack some features and wrap the transformation logic with the core count one. |
EDIT: The post below was written under the assumption that Given that OpenMP (and therefore the BLAS implementations in numpy) default to the maximum number of threads and therefore hurt MDAnalysis performance I would say we
By the way, |
It turns out the speed limiting step is not EDIT: |
So if I read your gist correctly then the profiling shows that
However, the big problem is the center of mass calculation in Is this run on 12 real cores or are you using hyperthreading?
EDIT: The table shows the "Per hit" time in µs. Also note the huge amount of time it takes to create new arrays or just do an element-wise operation. Does the latter include array copies, i.e., EDIT 2: updated table with n=6 values from https://gist.github.com/yuxuanzhuang/17ed6def63b08248db59f2c44e3e0419 and note that it's a 6-core cpu |
Judging from @yuxuanzhuang 's benchmarks, oversubscribing cores for OpenMP hurts performance in unexpected ways, even for code where there's no obvious parallelization going on (
|
It is also odd that oversubscription appears to affect the performance of the actual read step in the XDRReader, see https://gist.github.com/yuxuanzhuang/4918cd1b5d8d62de79eab9df40de4bb7#gistcomment-3488210 but only when transformations are added. |
Related gist:
The tests were down with
|
This fix looks good to me given the problem |
Thanks! Let me know if I have missed anywhere. |
@lilyminium do we use the maintainer/conda/environment.yml for anything / should we update it too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of small typos, a test and a question re: documenting thread limitations.
|
||
To define a new Transformation, :class:`TransformationBase` | ||
has to be subclassed. | ||
``max_threads`` will be set to ``None`` in default, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``max_threads`` will be set to ``None`` in default, | |
``max_threads`` will be set to ``None`` by default, |
the environment variable :envvar:`OMP_NUM_THREADS` | ||
(see the `OpenMP specification for OMP_NUM_THREADS <https://www.openmp.org/spec-html/5.0/openmpse50.html>`_) | ||
are used. | ||
``parallelizable`` will be set to ``True`` in default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``parallelizable`` will be set to ``True`` in default. | |
``parallelizable`` will be set to ``True`` by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You addressed all my comments but @IAlibay raised important points so please address these.
There's one RDKit test failing https://github.com/MDAnalysis/mdanalysis/pull/2950/checks?check_run_id=2297717464
I assume that this one has nothing to do with this PR? I am listed as the Assignee. However, @IAlibay when you're happy with the PR I will not object to you merging. Or ping me when a final look-over is needed & I'll happily do the merge. |
I'll restart CI, we really need to make @cbouy's PRs our next priority for 2.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, I'll let you have a final look/merge @orbeckst
In the interest of progressing towards 2.0, I'll go ahead with the squash-merge. |
Thanks.
… Am 4/10/21 um 03:33 schrieb Irfan Alibay ***@***.***>:
In the interest of progressing towards 2.0, I'll go ahead with the squash-merge.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Thanks for the review! |
Version 0.19.2 is too old and does not support Python-3.9. Remark: threadpoolctl is disabled in this port. See <MDAnalysis/mdanalysis#2950> for the impacts on performance. Releases notes at <https://github.com/MDAnalysis/mdanalysis/releases>. PR: 264716 Approved by: yuri (maintainer)
Fixes #2996 and MDAnalysis/pmda#144
Changes made in this Pull Request:
TransformationBase
class for handling thread limiting.TransformationBase
has aparallelizable
to check if it can be used in the parallel analysis (split-apply-combine approach)PR Checklist