Limit numpy thread usage for `Transformation` classes #2950

yuxuanzhuang · 2020-09-19T16:37:22Z

Fixes #2996 and MDAnalysis/pmda#144

Changes made in this Pull Request:

add a TransformationBase class for handling thread limiting.
TransformationBase has a parallelizable to check if it can be used in the parallel analysis (split-apply-combine approach)

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

orbeckst

This is an interesting addition. I might be commenting prematurely but here are initial comments

I assume that under normal circumstances, multithreading is beneficial, i.e., whenever someone just runs MDA on a multicore machine without thinking about parallelization. Under these conditions we would not want to limit the threads, would we? Is there a sensible way by which we can make the thread limiting optional?
tests
docs
changes

Did you check the performance impact of limiting threads?

IAlibay

echoing @orbeckst's comment.

I realise this is linked to MDAnalysis/pmda#144, but it might be good to have an issue on MDAnalysis itself. In part because it would be to go through all the use cases & see where benchmarks vary between users (I wouldn't be surprised if you end up getting a lot of variance on whether or not enabling multithreading speeds/slows things down for a given use case).

IAlibay · 2020-09-23T10:16:27Z

.travis.yml

@@ -30,7 +30,7 @@ env:
    - SETUP_CMD="${PYTEST_FLAGS}"
    - BUILD_CMD="pip install -e package/ && (cd testsuite/ && python setup.py build)"
    - CONDA_MIN_DEPENDENCIES="mmtf-python biopython networkx cython matplotlib scipy griddataformats hypothesis gsd codecov"
-    - CONDA_DEPENDENCIES="${CONDA_MIN_DEPENDENCIES} seaborn>=0.7.0 clustalw=2.1 netcdf4 scikit-learn joblib>=0.12 chemfiles tqdm>=4.43.0 tidynamics>=1.0.0 rdkit>=2020.03.1 h5py"
+    - CONDA_DEPENDENCIES="${CONDA_MIN_DEPENDENCIES} seaborn>=0.7.0 clustalw=2.1 netcdf4 scikit-learn joblib>=0.12 chemfiles tqdm>=4.43.0 tidynamics>=1.0.0 rdkit>=2020.03.1 h5py threadpoolctl"


threadpoolctl isn't a core dependency, so as it is, it will fail when running on minimal dependencies.

yuxuanzhuang · 2020-09-28T08:18:27Z

I am afraid that with the decorator (the current implementation), we cannot change the thread limit at runtime. (e.g. passing a self.n_thread arg). However, we can do it inside __init__ by self.__call__ = threadpool_limits_decorator(self.n_thread)(self.__call__), which I think is not that clear/pretty as with the decorator.

As for the impact on performance, I can reproduce the results (MDAnalysis/pmda#144) with another 6-core CPU. Sorry for the rough data, but you can see it's mainly the hyperthreading that is limiting the performance.

Note this is purely the serial code (no PMDA involved)

I will create an MDA issue when I have time.

orbeckst · 2020-09-29T00:41:52Z

The decorator is pretty but if it does not provide enough flexibility, just use the context manager:

class RotateTransformation:
   def __init__(self, rotmat, max_threads=None):
       self.rotmat = rotmat
       self.n_thread = max_threads  # possibly needs some logic here to "do the best thing"
  
   def __call__(self, ts):
        with threadpool_limits(self.n_thread):
              ts.positions[:] = np.dot(ts.positions, self.rotmat)
        return ts

That still looks reasonable to me.

Or is there a deeper reason for preferring the function decorator?

The bigger question is what the default for max_threads ought to be and if all transformations need to have it. I think the common use case is serial on multicore so it should probably default to whatever the thread-parallel code wants to do (even though with hyper-threading this might be sub-optimal). We should then give guidance on how to tune the settings.

yuxuanzhuang · 2020-10-06T18:17:42Z

Sorry for the radio silence...was pretty busy last week. I think we can port this function into PMDA if we decide serial code just goes into its default setting.

And in PMDA, say if the user is requiring all the cores, then we limit the self._single_frame to one thread so that everything inside, Transformation included, will be run with one thread. This is also the reason I prefer to use a decorator.

orbeckst · 2020-10-09T06:26:29Z

It looks like a worthwhile tuning knob to include in the MDAnalysis transformations. Then PMDA (and anyone else) can just use the optional max_threads as needed.

https://github.com/joblib/threadpoolctl/blob/32037cf43a61909282b5d07e6b21d9621fa03e25/threadpoolctl.py#L154 says that threadpool_limits(limits=None) does nothing so we can just use None as a default for max_threads=None.

Including threadpoolctl as dependency should be easy (conda-forge or pip).

orbeckst · 2020-10-09T06:33:54Z

I think we can port this function into PMDA if we decide serial code just goes into its default setting.

Do you mean then we don't have threadctl in MDA? Is this because then the user would have to limit the threads when they create the transformations, i.e., they need to know that they will use PMDA?

And in PMDA, say if the user is requiring all the cores, then we limit the self._single_frame to one thread so that everything inside, Transformation included, will be run with one thread.

I can see that it's easier to just apply the limits to everything in PMDA.

I don't think that it will be easy to figure out if there's any oversubscription of the CPU, at least not when someone is using distributed. For multiprocessing you could try to find out how many cores are available but this starts getting complicated, I feel.

This is also the reason I prefer to use a decorator.

Can you explain more? You mean so that you can decorate single_frame()?? Maybe an example would help me to understand.

orbeckst · 2020-10-09T17:03:05Z

@yuxuanzhuang in your performance comparison #2950 (comment), what is the default when you don't limit any threads? That number is missing from the comparison. If the default runs 12 threads (or enough to always reduce performance) then I better understand your point that you want to limit to 1 thread by default.

Did you check with threadpool_info() what the default settings were, i.e., how many threads np.dot uses by default?

yuxuanzhuang · 2020-10-12T13:29:01Z

It turns out there's an issue with this decoration approach that doesn't seem to be easily solvable---the thread limit is applied globally, instead of only the decorated function. Here is an example. https://gist.github.com/yuxuanzhuang/05b0da16a51f567e54f7f3f22591e316

yuxuanzhuang · 2020-10-12T13:51:51Z

@yuxuanzhuang in your performance comparison #2950 (comment), what is the default when you don't limit any threads? That number is missing from the comparison. If the default runs 12 threads (or enough to always reduce performance) then I better understand your point that you want to limit to 1 thread by default.

Did you check with threadpool_info() what the default settings were, i.e., how many threads np.dot uses by default?

The default is 12 threads (or whatever that computer has including hyper threads). I think for numpy, the threshold of asking for all the threads is quite low. So Transformation is always performing pretty bad in default.

IAlibay · 2020-10-12T13:53:46Z

@yuxuanzhuang apologies, I think I'm missing part of the conversation here 😅 From your test, it seems that the context manager method might work here right? Would that be suitable or is there a barrier to using it?

yuxuanzhuang · 2020-10-12T14:09:25Z

@yuxuanzhuang apologies, I think I'm missing part of the conversation here 😅 From your test, it seems that the context manager method might work here right? Would that be suitable or is there a barrier to using it?

You are right, the context manager method should work. I was thinking that with that you have to add the code snippet everywhere needed.

IAlibay · 2020-10-12T14:27:46Z

You are right, the context manager method should work. I was thinking that with that you have to add the code snippet everywhere needed.

I'll admit I'm not super versed on the transformation code, but it seems like pretty much everything is a class with both an __init__ and a __call__. Is there any way we could create a base class for all these so that we only have to implement one __call__ that has the context manager, and then calls some other function (let's say _run or something)?

jbarnoud · 2020-10-12T20:10:07Z

The transformations are any callable that takes and returns a timestep. In practice, callables are not always pickable, hence the use of classes. Since all the transformations we will ship will be such classes, they may as well pack some features and wrap the transformation logic with the core count one.

orbeckst · 2020-10-13T02:18:58Z

EDIT: The post below was written under the assumption that np.dot() and other external function calls that use OpenMP can limit MDAnalysis performance when too many OpenMP threads are enabled and oversubscription happens. However, that does not seem to be the situation #2950 (comment) so ignore what I wrote until we know better what is actually happening.

Given that OpenMP (and therefore the BLAS implementations in numpy) default to the maximum number of threads and therefore hurt MDAnalysis performance I would say we

add threadpoolctl as a dependency
make our transformation classes accept a max_threads kwarg in the constructor
use the threadpoolctl.threadpool_limits context manager to limit our transformations
change the transformations base class to have a _transform() method (similar to _single_frame() in AnalysisBase) that is called inside the context manager in the __call__() method

By the way, np.dot() shows up pretty frequently in the code (encore, diffusionmap, gnm, helix_analysis, rms, pca, atoms.transform). Is it always the case that the perform could be improved by limiting the thread number?

yuxuanzhuang · 2020-10-13T09:52:13Z

It turns out the speed limiting step is not np.dot (which is actually faster with multi-threads).
Due to the previously mentioned globally-applied thread limiting, the performance is actually enhanced by other parts of the code. (Sorry for the confusion, should do the line profiler earlier).
What confused me now is that such acceleration (or stall) only happens with Transformation present. (~~https://gist.github.com/yuxuanzhuang/c14f1889cd49c42fd661a860bf7e92d7~~) I will look into that later, but if you have any insight, please comment!

EDIT:
Updating gist
https://gist.github.com/yuxuanzhuang/17ed6def63b08248db59f2c44e3e0419

orbeckst · 2020-10-13T18:17:06Z

So if I read your gist correctly then the profiling shows that np.dot() is not a problem and even with 12 threads runs marginally faster:

class	function	n=1	n=6	n=12
_single_frame	CalcRMSDRotationalMatrix()	295.9	309.6	515.7
	mobile_atoms.center()	2860.4	3025.6	5498.6
	self._mobile_coordinates64[:] = self.mobile_atoms.positions	1414.2	1481.7	3041.7
	self._mobile_coordinates64 -= mobile_com	592.7	607.6	998.6
fit_rot_trans	align.rotation_matrix()	1367.7	1364.2	1790.8
	mobile.atoms.center()	2843.4	2849.7	5525.8
	mobile_coordinates = self.mobile.atoms.positions - mobile_com	2149.6	2156.4	4640.1
	np.dot(ts.positions, rotation.T)	1030.1	666.1	944.9

However, the big problem is the center of mass calculation in atoms.center() and the simple array creation/subtractions. These are all actually much slower than the rotation matrix calculation and they get much worse for higher thread numbers.

Is this run on 12 real cores or are you using hyperthreading?

It's a 6-core CPU. (https://gist.github.com/yuxuanzhuang/17ed6def63b08248db59f2c44e3e0419#gistcomment-3488152)

EDIT: The table shows the "Per hit" time in µs. Also note the huge amount of time it takes to create new arrays or just do an element-wise operation. Does the latter include array copies, i.e., atoms.positions with fancy indexing?

EDIT 2: updated table with n=6 values from https://gist.github.com/yuxuanzhuang/17ed6def63b08248db59f2c44e3e0419 and note that it's a 6-core cpu

orbeckst · 2020-10-13T20:30:43Z

Judging from @yuxuanzhuang 's benchmarks, oversubscribing cores for OpenMP hurts performance in unexpected ways, even for code where there's no obvious parallelization going on (AtomGroup.center() takes twice the amount of time in his example!).

np.dot() shows some modest improvements in performance, as long as OpenMP does not oversubscribe cores.

orbeckst · 2020-10-13T20:45:51Z

It is also odd that oversubscription appears to affect the performance of the actual read step in the XDRReader, see https://gist.github.com/yuxuanzhuang/4918cd1b5d8d62de79eab9df40de4bb7#gistcomment-3488210 but only when transformations are added.

yuxuanzhuang · 2020-10-14T08:02:49Z

Related gist:

Thread effect on MDAnalysis RMSD analysis: https://gist.github.com/yuxuanzhuang/17ed6def63b08248db59f2c44e3e0419
Thread effect on MDAnalysis position visiting: https://gist.github.com/yuxuanzhuang/4918cd1b5d8d62de79eab9df40de4bb7
Thread effect on numpy (code after np.dot): https://gist.github.com/yuxuanzhuang/82e1e7b57d0cda80ac964d1cd138f618

The tests were down with

6-core CPU
numpy 1.19.1
python 3.8

threadpool_info()
[{'filepath': '/home/scottzhuang/anaconda3/envs/gsoc/lib/python3.8/site-packages/numpy.libs/libopenblasp-r0-ae94cfde.3.9.dev.so',
  'prefix': 'libopenblas',
  'user_api': 'blas',
  'internal_api': 'openblas',
  'version': '0.3.9.dev',
  'num_threads': 12,
  'threading_layer': 'pthreads'},
 {'filepath': '/home/scottzhuang/anaconda3/envs/gsoc/lib/libgomp.so.1.0.0',
  'prefix': 'libgomp',
  'user_api': 'openmp',
  'internal_api': 'openmp',
  'version': None,
  'num_threads': 12},
 {'filepath': '/home/scottzhuang/anaconda3/envs/gsoc/lib/libmkl_rt.so',
  'prefix': 'libmkl_rt',
  'user_api': 'blas',
  'internal_api': 'mkl',
  'version': '2020.0.1',
  'num_threads': 6,
  'threading_layer': 'intel'},
 {'filepath': '/home/scottzhuang/anaconda3/envs/gsoc/lib/libiomp5.so',
  'prefix': 'libiomp',
  'user_api': 'openmp',
  'internal_api': 'openmp',
  'version': None,
  'num_threads': 12}]

richardjgowers · 2020-10-15T08:14:47Z

This fix looks good to me given the problem

yuxuanzhuang · 2021-04-07T10:34:07Z

Make sure that threadpoolctl is always installed (setup.py and elsewhere... @IAlibay might know about some other files where it needs to be).

As far as I am aware, we'll need it in the gh actions workflow, the travis cron job, azure pipelines, and setup.py.

Thanks! Let me know if I have missed anywhere.

IAlibay · 2021-04-07T11:22:20Z

Make sure that threadpoolctl is always installed (setup.py and elsewhere... @IAlibay might know about some other files where it needs to be).

As far as I am aware, we'll need it in the gh actions workflow, the travis cron job, azure pipelines, and setup.py.

Thanks! Let me know if I have missed anywhere.

@lilyminium do we use the maintainer/conda/environment.yml for anything / should we update it too?

.travis.yml

IAlibay

Couple of small typos, a test and a question re: documenting thread limitations.

package/MDAnalysis/transformations/base.py

IAlibay · 2021-04-07T16:56:57Z

package/MDAnalysis/transformations/base.py

+
+    To define a new Transformation, :class:`TransformationBase`
+    has to be subclassed.
+    ``max_threads`` will be set to ``None`` in default,


Suggested change

``max_threads`` will be set to ``None`` in default,

``max_threads`` will be set to ``None`` by default,

IAlibay · 2021-04-07T16:57:28Z

package/MDAnalysis/transformations/base.py

+    the environment variable :envvar:`OMP_NUM_THREADS`
+    (see the `OpenMP specification for OMP_NUM_THREADS <https://www.openmp.org/spec-html/5.0/openmpse50.html>`_)
+    are used.
+    ``parallelizable`` will be set to ``True`` in default.


Suggested change

``parallelizable`` will be set to ``True`` in default.

``parallelizable`` will be set to ``True`` by default.

package/MDAnalysis/transformations/fit.py

orbeckst

You addressed all my comments but @IAlibay raised important points so please address these.

orbeckst · 2021-04-08T16:08:03Z

There's one RDKit test failing https://github.com/MDAnalysis/mdanalysis/pull/2950/checks?check_run_id=2297717464

_____________ TestRDKitConverter.test_mol_from_selection[ILE-38-1] _____________
[gw0] linux -- Python 3.8.8 /usr/share/miniconda/envs/test/bin/python

self = <MDAnalysisTests.coordinates.test_rdkit.TestRDKitConverter object at 0x7f20c185ec70>
peptide = <AtomGroup with 155 atoms>, resname = 'ILE', n_atoms = 38
n_fragments = 1

    @pytest.mark.parametrize("resname, n_atoms, n_fragments", [
        ("PRO", 14, 1),
        ("ILE", 38, 1),
        ("ALA", 20, 2),
        ("GLY", 21, 3),
    ])
    def test_mol_from_selection(self, peptide, resname, n_atoms, n_fragments):
        mol = peptide.select_atoms("resname %s" % resname).convert_to("RDKIT")
>       assert n_atoms == mol.GetNumAtoms()
E       assert 38 == 1
E        +  where 1 = <bound method GetNumAtoms of <rdkit.Chem.rdchem.Mol object at 0x7f20b9a6a280>>()
E        +    where <bound method GetNumAtoms of <rdkit.Chem.rdchem.Mol object at 0x7f20b9a6a280>> = <rdkit.Chem.rdchem.Mol object at 0x7f20b9a6a280>.GetNumAtoms

/home/runner/work/mdanalysis/mdanalysis/testsuite/MDAnalysisTests/coordinates/test_rdkit.py:186: AssertionError

I assume that this one has nothing to do with this PR?

I am listed as the Assignee. However, @IAlibay when you're happy with the PR I will not object to you merging. Or ping me when a final look-over is needed & I'll happily do the merge.

lilyminium · 2021-04-08T16:11:38Z

@orbeckst it sounds like it's related to #2958 but I have not personally seen this failure before.

IAlibay · 2021-04-08T16:12:43Z

I'll restart CI, we really need to make @cbouy's PRs our next priority for 2.0.

IAlibay

lgtm, I'll let you have a final look/merge @orbeckst

IAlibay · 2021-04-10T10:33:14Z

In the interest of progressing towards 2.0, I'll go ahead with the squash-merge.

orbeckst · 2021-04-10T15:51:58Z

Thanks.

…

Am 4/10/21 um 03:33 schrieb Irfan Alibay ***@***.***>: In the interest of progressing towards 2.0, I'll go ahead with the squash-merge. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

yuxuanzhuang · 2021-04-10T19:07:12Z

Thanks for the review!

Version 0.19.2 is too old and does not support Python-3.9. Remark: threadpoolctl is disabled in this port. See <MDAnalysis/mdanalysis#2950> for the impacts on performance. Releases notes at <https://github.com/MDAnalysis/mdanalysis/releases>. PR: 264716 Approved by: yuri (maintainer)

Yuxuan Zhuang and others added 2 commits September 19, 2020 18:31

add threadlimit deco

b8441ff

add threadlimit deco

604fa1b

yuxuanzhuang mentioned this pull request Sep 19, 2020

Bad Performance of Parallelization with On-the-fly Transformation MDAnalysis/pmda#144

Open

add dep

0ed46f9

orbeckst requested changes Sep 21, 2020

View reviewed changes

IAlibay requested changes Sep 23, 2020

View reviewed changes

runtime change

b41a739

orbeckst mentioned this pull request Oct 9, 2020

Limit number of threads/cores in parallel calculations #2975

Open

CharlyEmpereurmot referenced this pull request in GMPavanLab/Swarm-CG Oct 12, 2020

Faster calculations of the distributions

7c623c9

context mananger instead of decor

3941281

working decor

326e105

yuxuanzhuang added 5 commits April 7, 2021 12:18

doc for transformation fix

edb5ef5

box dimension rework

7884210

add threadpoolctl to azure

636ae12

add threadpoolctl to gh ci

1ea51df

merge to develop

197deb9

IAlibay reviewed Apr 7, 2021

View reviewed changes

.travis.yml Show resolved Hide resolved

IAlibay requested changes Apr 7, 2021

View reviewed changes

orbeckst approved these changes Apr 7, 2021

View reviewed changes

lilyminium mentioned this pull request Apr 7, 2021

Amend user guide for 2.0 MDAnalysis/UserGuide#139

Open

yuxuanzhuang added 4 commits April 8, 2021 15:50

add threadpoolctl to ppc64le

e2ca135

cov notimplement error transformation

aba771f

add note for maxthread=1

890e2b3

changelog for threadpooltcl

17d4cb4

IAlibay approved these changes Apr 8, 2021

View reviewed changes

orbeckst mentioned this pull request Apr 8, 2021

RDKIT tests sometimes fails #2958

Closed

IAlibay merged commit 13a5df0 into MDAnalysis:develop Apr 10, 2021

orbeckst mentioned this pull request Oct 13, 2021

Expose OpenMP backends to more analysis methods #3435

Open

fiona-naughton added defect parallelization Component-Transformations On-the-fly transformations labels Sep 26, 2023

yuxuanzhuang mentioned this pull request May 1, 2024

Limit threads usage in numpy during test to avoid time-out #4584

Draft

5 tasks

	``max_threads`` will be set to ``None`` in default,
	``max_threads`` will be set to ``None`` by default,

	``parallelizable`` will be set to ``True`` in default.
	``parallelizable`` will be set to ``True`` by default.

Limit numpy thread usage for Transformation classes #2950

Limit numpy thread usage for Transformation classes #2950

Conversation

yuxuanzhuang commented Sep 19, 2020 • edited Loading

PR Checklist

orbeckst left a comment

Choose a reason for hiding this comment

IAlibay left a comment

Choose a reason for hiding this comment

IAlibay Sep 23, 2020

Choose a reason for hiding this comment

yuxuanzhuang commented Sep 28, 2020

orbeckst commented Sep 29, 2020

yuxuanzhuang commented Oct 6, 2020

orbeckst commented Oct 9, 2020

orbeckst commented Oct 9, 2020

orbeckst commented Oct 9, 2020

yuxuanzhuang commented Oct 12, 2020

yuxuanzhuang commented Oct 12, 2020

IAlibay commented Oct 12, 2020

yuxuanzhuang commented Oct 12, 2020

IAlibay commented Oct 12, 2020

jbarnoud commented Oct 12, 2020

orbeckst commented Oct 13, 2020 • edited Loading

yuxuanzhuang commented Oct 13, 2020 • edited Loading

orbeckst commented Oct 13, 2020 • edited Loading

orbeckst commented Oct 13, 2020

orbeckst commented Oct 13, 2020 • edited Loading

yuxuanzhuang commented Oct 14, 2020

richardjgowers commented Oct 15, 2020

yuxuanzhuang commented Apr 7, 2021

IAlibay commented Apr 7, 2021

IAlibay left a comment

Choose a reason for hiding this comment

IAlibay Apr 7, 2021

Choose a reason for hiding this comment

IAlibay Apr 7, 2021

Choose a reason for hiding this comment

orbeckst left a comment

Choose a reason for hiding this comment

orbeckst commented Apr 8, 2021

lilyminium commented Apr 8, 2021

IAlibay commented Apr 8, 2021

IAlibay left a comment

Choose a reason for hiding this comment

IAlibay commented Apr 10, 2021

orbeckst commented Apr 10, 2021 via email

yuxuanzhuang commented Apr 10, 2021

Limit numpy thread usage for `Transformation` classes #2950

Limit numpy thread usage for `Transformation` classes #2950

yuxuanzhuang commented Sep 19, 2020 •

edited

Loading

orbeckst commented Oct 13, 2020 •

edited

Loading

yuxuanzhuang commented Oct 13, 2020 •

edited

Loading

orbeckst commented Oct 13, 2020 •

edited

Loading

orbeckst commented Oct 13, 2020 •

edited

Loading