Refactor encore `conformational_distance_matrix` #1145

kain88-de · 2016-12-31T14:08:16Z

Fixes #1144, #1114

Changes made in this Pull Request:

refactor conformational_distance_matrix to use joblib for parallel processing
allows debugging the parallel called functions
rename ncores to n_jobs to have scikit-learn semantics
n_jobs=-1 uses all available cores
general clean ups

PR Checklist

Tests?
Docs?
~~- [ ] CHANGELOG updated?~~
Issue raised/referenced?

kain88-de · 2016-12-31T14:12:21Z

@mtiberti and @wouterboomsma could you have a look over this when you have time.

kain88-de · 2017-01-05T12:26:14Z

yes it finally builds. So this is ready for a review. (also removing all the code gives a big bump to coverage)

richardjgowers · 2017-01-05T14:24:29Z

package/MDAnalysis/analysis/encore/utils.py

-        self.stdout.write(str(self))
-        self.stdout.flush()
-
-
 def trm_indeces(a, b):


Should be indices not indeces, is it worth fixing the typo here?

I'll put that on my todo list

kain88-de · 2017-01-09T15:25:07Z

@richardjgowers can you have a look why travis fails. My fix seems not to have worked.

richardjgowers · 2017-01-09T18:21:02Z

Cool we're up to +0.4%

kain88-de · 2017-01-09T19:41:52Z

Yeah removing code always has this nice effect.

wouterboomsma · 2017-01-12T14:51:16Z

package/MDAnalysis/analysis/encore/confdistmatrix.py

-        else:
-            a[0] = b[0]
-            a[1] = b[1] + 1
-


Was all this code unnecessary, or was it moved somewhere else? ( @mtiberti wrote this code, so I'm not entirely sure about this change )

joblib takes care now of generating good sized batches to work on. It does some automatic adjustment at the beginning. But we can also give it a rough estimate batchsize to use. This work stealling approach will use all available power until all computations are done. In the old code if a batch was done early the core would just idle until all batches were finished.

I see. Makes sense.

wouterboomsma · 2017-01-12T14:53:19Z

testsuite/MDAnalysisTests/analysis/test_encore.py

-    def test_rmsd_matrix_with_superimposition(self):        
+    @dec.skipif(module_not_found('sklearn'),
+                "Test skipped because sklearn is not available.")
+    def test_rmsd_matrix_with_superimposition(self):


Why is sklearn required here? In general, most of encore does not require sklearn. Only when selecting particular clustering or dimensionality reduction methods, sklearn is needed. But the default options for both use an a built-in method.

Well it does now. I'm using joblib to do the parallel load balancing. This allows me to debug the changes in #1136 since it actually returns the exceptions that are thrown in the multiprocessing.

We could also add joblib as a dependency to the mdanalysis package. Then the sklearn guards can be removed.

Not sure I understand the connection between sklearn and joblib. Are you saying we could have added a guard on joblib here instead of sklearn? - and that the above works because joblib is a dependency of sklearn?

sklearn has developed the joblib library for easier parallel programming in python. They ship that library as an external dependency in sklearn.externals.joblib but the joblib library can also be used as a separate package. I currently just decided to use the version bundled with sklearn. I could go to use the standalone package. I hope that makes it clearer.

So everything right now that calculates a conformation distance matrix needs to use sklearn.

nope. I might have to replace it with @dec.skipif(module_not_found('sklearn.externals.joblib') to detect only the joblib inside of sklearn. But joblib is a pure python package so installing it isn't any problem. I'll have some time on the weekend to do this.

So I mean I'll use the joblib package. Then so sklearn won't need to be installed.

I see. Thanks. I thought maybe sklearn would expose joblist as a global module when importing sklearn - although now that I think about it that would be a pretty ugly side effect.

I don't want to turn this into a huge deal. I can live with it either way - and you certainly have a better idea of what the overall strategy is for MDAnalysis. So, your call.

I'd also go with explicit joblib dependency. After all, we might use it elsewhere, too, without sklearn. More discussion in #1159.

... ah sorry, late to the party, you already did it #1145 (comment)

wouterboomsma · 2017-01-12T14:56:00Z

@kain88-de Thanks for all the efforts in refactoring this code. I agree with most of what you did. However, I'm a bit puzzled about all the sklearn guards that have been added to test_encore. The test guards should only be necessary under very specific circumstances (when using non-standard clustering or dimensionality reduction methods.

mtiberti · 2017-01-12T17:00:33Z

Hi everyone,
thanks a lot for this work and sorry to have kept you waiting - I recently got back from winter break and had a bit of backlog to clear. Happy to hear that joblib will improve the performance and usability of the code!

kain88-de · 2017-01-13T22:25:51Z

I switched to use the joblib package now instead of the version shipped in scikit-learn. This follows our current unwritten rule that compilied packages in mda.analysis.* should only be an optional dependency for small parts of the code.

richardjgowers · 2017-01-14T12:50:09Z

testsuite/MDAnalysisTests/analysis/test_encore.py

+            if 'encore' in mod:
+                sys.modules.pop(mod, None)
+
+    @block_import('sklearn')


I'm a little confused why this still passes, it should warn when joblib is blocked? I'll have to look into this before merging

Those tested packages still depend on sklarn. I shortened the list of tested imported modules. The joblib library is now a dependency and always installed.

Ah ok, thanks!

This is included in scikit-learn. It does work stealing job balancing for us that helps to use the full processor power. The absolute biggest advantage of this is though that I can now include print/exception errors for debugging in the `conf_dist_function`.

now we only rely on numpy functions and joblib

This gives the initialization more freedom. We can have more types to choose from and the metadata can be passed in as a dict and still be correctly handled.

it isn't used anymore and there were license issues with it

We don't use the sklearn packaged version to have most of the encore distribution run normally inside of MDAnalysis

kain88-de · 2017-01-14T14:31:35Z

I removed the merge conflicts. Anything else that needs changing?

wouterboomsma · 2017-01-15T15:00:22Z

@kain88-de Thanks. Looks great to me.

kain88-de requested a review from mtiberti December 31, 2016 14:11

kain88-de force-pushed the refactor-encore-confdist branch 6 times, most recently from e1d72b9 to 5dcc40a Compare January 5, 2017 08:25

richardjgowers reviewed Jan 5, 2017

View reviewed changes

kain88-de force-pushed the refactor-encore-confdist branch from 400199d to 7076b4c Compare January 8, 2017 20:27

kain88-de mentioned this pull request Jan 11, 2017

Refactor align #1136

Closed

7 tasks

wouterboomsma reviewed Jan 12, 2017

View reviewed changes

kain88-de mentioned this pull request Jan 13, 2017

Handling Optional dependencies for the analysis module #1159

Closed

kain88-de force-pushed the refactor-encore-confdist branch 2 times, most recently from 41b1bd2 to a5a0548 Compare January 13, 2017 22:24

richardjgowers requested changes Jan 14, 2017

View reviewed changes

richardjgowers approved these changes Jan 14, 2017

View reviewed changes

kain88-de added 5 commits January 14, 2017 15:29

remove multiprocessing completely

74be2a2

now we only rely on numpy functions and joblib

refactor TriangularMatrix

eb7bf3b

This gives the initialization more freedom. We can have more types to choose from and the metadata can be passed in as a dict and still be correctly handled.

remove unused progressbar code

a0c5455

use n_jobs instead of ncores

91e3a2a

kain88-de and others added 10 commits January 14, 2017 15:29

update docs

0f3dd34

deactivate tests in minimal build

47858ca

use include guards for joblib

e7ddb3f

remove ProgressBar

0d068da

it isn't used anymore and there were license issues with it

fix trm_indices spelling

8950320

TST: Added tests for analysis.encore import warnings

a541e15

Update test_encore.py

c2656bf

fix failing test suite

8e1e525

TST: Fixed block_import not blocking subpackages

0dcbd8d

switch to use standalone joblib package

c069958

We don't use the sklearn packaged version to have most of the encore distribution run normally inside of MDAnalysis

kain88-de force-pushed the refactor-encore-confdist branch from a5a0548 to c069958 Compare January 14, 2017 14:30

kain88-de merged commit 789a96c into MDAnalysis:develop Jan 15, 2017

jbarnoud mentioned this pull request Jan 17, 2017

Errors on running tests after setting up development environment #1166

Closed

richardjgowers mentioned this pull request Jan 17, 2017

Encore tests failing due to float precision #1168

Closed

kain88-de deleted the refactor-encore-confdist branch January 20, 2017 08:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor encore `conformational_distance_matrix` #1145

Refactor encore `conformational_distance_matrix` #1145

kain88-de commented Dec 31, 2016 •

edited

Loading

kain88-de commented Dec 31, 2016

kain88-de commented Jan 5, 2017

richardjgowers Jan 5, 2017

kain88-de Jan 5, 2017

kain88-de commented Jan 9, 2017

richardjgowers commented Jan 9, 2017

kain88-de commented Jan 9, 2017

wouterboomsma Jan 12, 2017

kain88-de Jan 12, 2017

wouterboomsma Jan 12, 2017

wouterboomsma Jan 12, 2017

kain88-de Jan 12, 2017

kain88-de Jan 12, 2017

wouterboomsma Jan 12, 2017

kain88-de Jan 12, 2017 •

edited

Loading

kain88-de Jan 13, 2017

kain88-de Jan 13, 2017

wouterboomsma Jan 13, 2017

orbeckst Jan 13, 2017

orbeckst Jan 13, 2017

wouterboomsma commented Jan 12, 2017

mtiberti commented Jan 12, 2017

kain88-de commented Jan 13, 2017

richardjgowers Jan 14, 2017

kain88-de Jan 14, 2017

richardjgowers Jan 14, 2017

kain88-de commented Jan 14, 2017

wouterboomsma commented Jan 15, 2017

Refactor encore conformational_distance_matrix #1145

Refactor encore conformational_distance_matrix #1145

Conversation

kain88-de commented Dec 31, 2016 • edited Loading

PR Checklist

kain88-de commented Dec 31, 2016

kain88-de commented Jan 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kain88-de commented Jan 9, 2017

richardjgowers commented Jan 9, 2017

kain88-de commented Jan 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kain88-de Jan 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wouterboomsma commented Jan 12, 2017

mtiberti commented Jan 12, 2017

kain88-de commented Jan 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kain88-de commented Jan 14, 2017

wouterboomsma commented Jan 15, 2017

Refactor encore `conformational_distance_matrix` #1145

Refactor encore `conformational_distance_matrix` #1145

kain88-de commented Dec 31, 2016 •

edited

Loading

kain88-de Jan 12, 2017 •

edited

Loading