-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor encore conformational_distance_matrix
#1145
Refactor encore conformational_distance_matrix
#1145
Conversation
@mtiberti and @wouterboomsma could you have a look over this when you have time. |
e1d72b9
to
5dcc40a
Compare
yes it finally builds. So this is ready for a review. (also removing all the code gives a big bump to coverage) |
self.stdout.write(str(self)) | ||
self.stdout.flush() | ||
|
||
|
||
def trm_indeces(a, b): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be indices not indeces, is it worth fixing the typo here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll put that on my todo list
400199d
to
7076b4c
Compare
@richardjgowers can you have a look why travis fails. My fix seems not to have worked. |
Cool we're up to +0.4% |
Yeah removing code always has this nice effect. |
else: | ||
a[0] = b[0] | ||
a[1] = b[1] + 1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was all this code unnecessary, or was it moved somewhere else? ( @mtiberti wrote this code, so I'm not entirely sure about this change )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
joblib takes care now of generating good sized batches to work on. It does some automatic adjustment at the beginning. But we can also give it a rough estimate batchsize to use. This work stealling approach will use all available power until all computations are done. In the old code if a batch was done early the core would just idle until all batches were finished.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Makes sense.
def test_rmsd_matrix_with_superimposition(self): | ||
@dec.skipif(module_not_found('sklearn'), | ||
"Test skipped because sklearn is not available.") | ||
def test_rmsd_matrix_with_superimposition(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is sklearn
required here? In general, most of encore does not require sklearn
. Only when selecting particular clustering or dimensionality reduction methods, sklearn
is needed. But the default options for both use an a built-in method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it does now. I'm using joblib to do the parallel load balancing. This allows me to debug the changes in #1136 since it actually returns the exceptions that are thrown in the multiprocessing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also add joblib as a dependency to the mdanalysis package. Then the sklearn guards can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the connection between sklearn and joblib. Are you saying we could have added a guard on joblib here instead of sklearn? - and that the above works because joblib is a dependency of sklearn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sklearn has developed the joblib
library for easier parallel programming in python. They ship that library as an external dependency in sklearn.externals.joblib
but the joblib library can also be used as a separate package. I currently just decided to use the version bundled with sklearn. I could go to use the standalone package. I hope that makes it clearer.
So everything right now that calculates a conformation distance matrix needs to use sklearn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope. I might have to replace it with @dec.skipif(module_not_found('sklearn.externals.joblib')
to detect only the joblib inside of sklearn. But joblib is a pure python package so installing it isn't any problem. I'll have some time on the weekend to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I mean I'll use the joblib package. Then so sklearn won't need to be installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks. I thought maybe sklearn would expose joblist as a global module when importing sklearn - although now that I think about it that would be a pretty ugly side effect.
I don't want to turn this into a huge deal. I can live with it either way - and you certainly have a better idea of what the overall strategy is for MDAnalysis. So, your call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also go with explicit joblib dependency. After all, we might use it elsewhere, too, without sklearn. More discussion in #1159.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... ah sorry, late to the party, you already did it #1145 (comment)
@kain88-de Thanks for all the efforts in refactoring this code. I agree with most of what you did. However, I'm a bit puzzled about all the sklearn guards that have been added to test_encore. The test guards should only be necessary under very specific circumstances (when using non-standard clustering or dimensionality reduction methods. |
Hi everyone, |
41b1bd2
to
a5a0548
Compare
I switched to use the joblib package now instead of the version shipped in scikit-learn. This follows our current unwritten rule that compilied packages in |
if 'encore' in mod: | ||
sys.modules.pop(mod, None) | ||
|
||
@block_import('sklearn') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused why this still passes, it should warn when joblib is blocked? I'll have to look into this before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those tested packages still depend on sklarn. I shortened the list of tested imported modules. The joblib library is now a dependency and always installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, thanks!
This is included in scikit-learn. It does work stealing job balancing for us that helps to use the full processor power. The absolute biggest advantage of this is though that I can now include print/exception errors for debugging in the `conf_dist_function`.
now we only rely on numpy functions and joblib
This gives the initialization more freedom. We can have more types to choose from and the metadata can be passed in as a dict and still be correctly handled.
it isn't used anymore and there were license issues with it
We don't use the sklearn packaged version to have most of the encore distribution run normally inside of MDAnalysis
a5a0548
to
c069958
Compare
I removed the merge conflicts. Anything else that needs changing? |
@kain88-de Thanks. Looks great to me. |
Fixes #1144, #1114
Changes made in this Pull Request:
PR Checklist
- [ ] CHANGELOG updated?