-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas output #39
Pandas output #39
Conversation
Super, thanks. I'm super busy this week. I'll try to squeeze in a code-review and test-drive anyway. |
scikit_mol/fingerprints.py
Outdated
|
||
Also sets the column prefix for use by the transform method with dataframe output. | ||
""" | ||
self._get_column_prefix() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not doing anything as the return is simply discarded???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I added it there when I was developing the code, then I forgot about it.
Done in 49d39d0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Enrico,
First of all, thanks for the pull request, it's nice to see that people want to use and contribute to Scikit-Mol.
However, I wont merge it for main yet, I think we need some discussion on the approach. Your changes highlight a shortcomming/incompatibility in the smiles and mol converters that needs to be brought into compliance, but I'm reluctant to directly build in column selector code into the transformer classes. You can see my comments, and we could have a brief meeting to discuss.
tests/test_smilestomol.py
Outdated
mols = smilestomol_transformer.transform(to_convert) | ||
assert isinstance(mols, pd.DataFrame) | ||
assert mols.columns.tolist() == ["molecule"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would stay with the pandastools defaults of 'ROMol', alternatively this needs to be configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used the default ROMol.
Done in ac73c12
scikit_mol/fingerprints.py
Outdated
# contains multiple molecule columns (as if from a column selector transformer). | ||
# In that case, the resulting array should be a concatenation of the fingerprint arrays | ||
# for each molecule column. | ||
return X.loc[:, "molecule"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't really be sure that the molecule column is called 'molecule'. It should either be configurable or we should stay with the convention that input to the transformer is always a single column. I think we may be going beyond the scope of the classes by building in column selections to them.
scikit_mol/fingerprints.py
Outdated
If needed this property can be overwritten in the child class. | ||
""" | ||
return np.int8 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not keep this simple and add a hidden property directly to the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I used a property to make sure that users were not tempted to override it.
But I agree that the most Pythonic way to achieve this is just to define private attributes.
Changed in 8cb5cfe
@@ -13,6 +13,13 @@ def __init__(self, parallel: Union[bool, int] = False): | |||
self.parallel = parallel | |||
self.start_method = None #TODO implement handling of start_method | |||
|
|||
def get_feature_names_out(self, input_features=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed by the base sklearn classes? I can't seem to understand where it's used otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what makes the "pandas output" mechanism work. If this method is defined, then the base classes will use it to generate the column names, and return the result as a Pandas DataFrame. If this method is not defined, we get an error when trying to set the pandas output for transformers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks for the clarification
scikit_mol/utilities.py
Outdated
@@ -53,3 +53,18 @@ def sanitize(self, X_smiles_list, y=None): | |||
self.errors = pd.DataFrame({'SMILES':X_errors}) | |||
|
|||
return X_out, X_errors | |||
|
|||
|
|||
def get_mols_from_X(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the place to do this I think. We should not make these kinds of column selections at this level. I would say, that we need to fix the input and output of smilesconverter and molfromsmiles converter to accept 2D input and pandas dataframes, under the assumption that there's only one column. We can assert that and throw a sensible error message about using ColumnTransformer if picking it out of a dataframe. Currently they accept iterables and returns lists or flat arrays, which isn't the way according to sklearn.
Although not completely in the vane of sklearn, I would like to keep the option to input lists and flat arrays as it's so fast to convert a list of smiles or mols with the tools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking of it, this lack of column vector or pandas input support is likely contributing to crashing your attempts to use column transformer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking of it, this lack of column vector or pandas input support is likely contributing to crashing your attempts to use column transformer.
Yes, I also thought that if we change how scikit-mol handles transformer input in this PR (by making it more compatible with scikit-learn), ColumnTransformer is more likely to work. I will try after this PR is ready!
Thanks for the feedback @EBjerrum ! I agree with your comments, most of them are quite easy to address. I will address them in my next commits. The main issue that emerged is about the "input format for scikit-mol transformer". I implemented my solution with the goal to make as few changes as possible, and maintaining backward compatibility. For this reason, I decided to write a quick utility function But if you agree that we should implement a more radical change in how the scikit-mol transformers accept an input, I would be happy to try to implement it! The main inconsistency between scikit-mol transformers and scikit-learn transformers is that:
Otherwise scikit-learn transformers raise a So, in my opinion, we could change scikit-mol transformers to make them natively accet 2D arrays with a single column:
Or do you think we should strive for maximum compatibility with scikit-learn? |
Yes, exactly. I think the change is needed to accept pandas input/2D arrays, although I don't want to turn scikit-mol into a pipeline toolkit of its own with complex configurable column selection code. If we implement the 2D input I'm pretty sure the scikit-learn columntransformer and simialrs will work. It could probably be in the form of a shared input sanitizer function, e.g. a function taking X and checking somehow what is it, a list or iterable, a 2D column array or a pandas dataframe with one column, and then transform it to the expected format for the rest of the code. A think if we start to want to support multicolumn input to the smiles converter and other transformers, that we open a can of complexity. So maybe instead of overthinking, we make the assumption that its a single column, assert it and give a sensible error message if its not. What are you thinking regarding the integration of this input sanitazion, should we simply call the function from each .transform in each class, or is there a solution with a mixin class, that wraps the subclass implemented .transform. or a decorator? I would really prefer to keep the ability to process flat lists, ideally we should have list = trf.transform(list) But that's not what scikit-learn is doing, they return numpy or pandas depending on a configurable global or object switch. It will also be difficult to support the return type of all kinds of standard iterables. |
I also think that is likely! I will try after this PR is ready.
A also thought about a very simple mixin class, something like Of course, I now realize that that function should not be in the scikit-mol
I think we should definitely keep supporting lists. But I think "preserving the input type" is a bad idea. That would definitely break compatibility with scikit-learn. In my opinion, we should return "what scikit-learn transformers would". Btw, now I think I have enough information to write some more code. When I am done, I will kindly ask you to proceed with the review! |
Yeah, mixins are probably not the choice here, although sometimes they are a good solution to a given challenge irrespective of some developers spine responses. I would probably go with a function decorator, over the in-function call. That way we can control both the input to, and later output from the function, and the transform functions will only need care about its own functionality on the single input type. Have you made decorators before?
In principle I agree with staying compatible, but I would maybe like to experiment with the other solution at a point in time, maybe configurable behaviour like the pd output type for scikit-learn. Then have a defaul "scikit_mode" and and "IO_types" mode for the transformers, that I can switch on when I get tired of not getting flat lists or arrays back. The decorator approach could be promising to enable that feature in the future.
Great! Happy Hacking |
Thanks for the effort, looking forward to go through your your code later this week! |
Hi @enricogandini , I think it really starts to shape up, thanks for the work. I made some edits during my testing of the changes, but I'm not exactly shure how I can contribute changes to your pull request?, So I've created a branch pandas_output that contains your changes and my suggestions, and I'll now try to make a pull request for your branch in your clone. |
Thanks @EBjerrum ! I will happily have a look at the suggested changes. In the end did you manage to make pull request from your pandas_output branch to the pandas_output branch of my fork? Becakuse I don't see one. Otherwise, I'll try to make one myself. Also, I saw your comment on Slack. Yes, I think it's possible to skip some tests. The missing scikit-learn feature from the Python 3.7 environment was just the |
@EBjerrum I managed to create a pull request from your version of the pandas_output branch to my version of the pandas_output branch. I will have a look at everything you added! This is the pull request link, if you are interested: enricogandini#2 I am not very familiar with cross-forks pull requests, but I think I managed to do this correctly! |
Changes by Esben I reviewed all your changes Esben, they all make sense. I agree that we should also support 1D arrays, I will add a warning as the comment suggests. Thanks for spotting that useless `.reshape` call at line 56 of `conversions.py`. I probably added that there, then made the relevant changes elsewhere, and forgot about it! Also thanks for making the Standardizer class accept all the new inputs that the other transformers will support
I think I managed to make a pull request, but GitHub was little weird when I tried to pull request the other way, I probably pushed the wrong button or something, I'm only orange belt in Git-Fu. Good that you could set it up!
There's automatic tests for python 3.7,3.8... 3.10 on Linux. It only fails for 3.7, I really suspect the scikit-learn for that version is too old to have the pandas output support. Failing test means the GitHub workflows stops, and it can't autoupdate pypi.
Esben Jannik Bjerrum
cand.pharm, Ph.D
Phone +46 76 303 90 13
http://dk.linkedin.com/in/esbenbjerrum
https://www.cheminformania.com
On Thursday, March 7, 2024 at 07:06:23 PM GMT+1, Enrico Gandini ***@***.***> wrote:
Hi @enricogandini , I think it really starts to shape up, thanks for the work. I made some edits during my testing of the changes, but I'm not exactly shure how I can contribute changes to your pull request?, So I've created a branch pandas_output that contains your changes and my suggestions, and I'll now try to make a pull request for your branch in your clone. Can you some way or another take a look at the proposed changes? Maybe I misunderstood something.
Thanks @EBjerrum ! I will happily have a look at the suggested changes. In the end did you manage to make pull request from your pandas_output branch to the pandas_output branch of my fork? Becakuse I don't see one. Otherwise, I'll try to make one myself.
Also, I saw your comment on Slack. Yes, I think it's possible to skip some tests. The missing scikit-learn feature from the Python 3.7 environment was just the set_output API, am I right? In that case, I think I can easily skip the tests that require that API, for Python <= 3.7 (I will check the actual version from the scikit-learn docs, maybe it's higher than that).
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Which was described on GitHub
So we test that everything works, regardless of the specific name of the pandas container.
I executed the code locally with 3.12.2, and it worked.
Hi @EBjerrum , I installed a local Python 3.7 environment, and the only tests that failed were the ones that specifically tried to use the scikit-learn set_output API (so, the "Pandas output" tests). I used I also added some more tests. I took the liberty of configuring GitHub test runs for all Python versions <= 3.12. I had been running all the tests on Python 3.12.2, and everything worked! In my opinion this is a great testament of how stable the language and the main scientific libraries have become in the last years. |
Yes, it's great with the increased stability. Theres a rising focus on addition of e.g. tests and proper installers with dependencies. Collaboration also helps, two pair of eyes on a piece of code, is better than one pair.
Sounds excellent with the updated tests. Let me know when you think it's ready, I think it's close for this feature addition and API fix. We should maybe afterwards also add a small notebook illustrating the pandas output, and your use-case.
As I understand it, its a situation with having a dataframe with existing values (e.g. from QM calculations??), that we then want to add some extra descriptors for, using ColumnFeature to select the SMILES to pass into a pipeline with smilestomol and a descriptor calculator.
Esben Jannik Bjerrum
cand.pharm, Ph.D
Phone +46 76 303 90 13
http://dk.linkedin.com/in/esbenbjerrum
https://www.cheminformania.com
On Friday, March 8, 2024 at 09:23:40 AM GMT+1, Enrico Gandini ***@***.***> wrote:
Hi @EBjerrum , I installed a local Python 3.7 environment, and the only tests that failed were the ones that specifically tried to use the scikit-learn set_output API (so, the "Pandas output" tests). I used pytest.skipif to skip those tests when the Python version is < 3.8 .
I also added some more tests.
I took the liberty of configuring GitHub test runs for all Python versions <= 3.12. I had been running all the tests on Python 3.12.2, and it worked! In my opinion this is a great testament of how stable the language and the main scientific libraries have become in the last years.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thanks @EBjerrum ! I am thinking about a couple of possible tests to further stabilize the feature, and I'll add a notebook with possible use cases.
Yes, one possible use-case is that users may have additional features, computed with some other packages. In this case, keeping everything in DataFrames with well-named features will simplify preprocessing and analysis of the results. Another use case would be to simplify feature importance analysis, as shown here. But in general, I think that making scikit-mol as compatible as possible with the scikit-learn features is a good idea. Scikit-learn is the best package for generic ML workflows. I think scikit-mol users shouldn't be constrained to use a subset of scikit-learn features, that support only the most common QSAR tasks. ML experts should be able to leverage the full flexibility of scikit-learn with the chemical features provided by scikit-mol. Of course, some exceptions make sense. I think supporting flat lists and 1D arrays / Series in scikit-mol is a good idea, because in the end a lot of common QSAR tasks will need only that. I'll add an example notebook, possibly improve the tests, and I'll let you know when I think I'm done! |
NOTE: this include an additional mandatory dependency, which is very small and pure-python: the `packaging` package. It's a package from the Python Packaging Authority, and it's the best tool to handle version information. It wasn't strictly necessary for this simple version check. But in the future, more complex version checks may be needed, and this package will help a lot.
It's not Python < 3.8 that doesn't support the sklearn set_output API: it's a sklearn version < 1.2. A sklearn version < 1.2 could be present in an environment with Python > 3.8. NOTE: should we raise a warning if the get_feature_names methods are used with sklearn < 1.2? In theory that method could still be used manually, but it would not be used authomatically by the sklearn set_output machinery.
Also, improve the message.
Previously, they were created with a single name (the last in the list of names!).
NOTE: there are no tests for the Standardizer class! I wanted to add a test about the sklearn pandas output for the Standardizer class, but there are no tests for it. Should we open an issue and a PR to add tests for it?
With all transformer classes that compute features (descriptors and fingerprints)
Having fingerprints column names with the same length is better for display purposes: fp_morgan_0001...fp_morgan_0200...fp_morgan_2048. But this may make it harder to test different fingerprint configurations: the column names will be based on the number of bits! For instance, if nBits is 512, the first fingerprint is called fp_morgan_001; if nBits is 1024, the first fingerprint is fp_morgan_0001. Now we will always call the first fingerprint fp_morgan_1. This will simplify processing the transformer outputs in production workflows. An helper method that returns the nice zero-prefixed column names is included.
Also, add a test dataset with CDDD pre-computed features that can be used with the notebook. TODO: create a test to demonstrate that ColumnTransformer can now be used.
with column transformer
Hi @EBjerrum , I think this is ready. Please have another look and let me know if you think this can be merged. I added an example notebook and some more tests. |
Great, thanks for the effort! I'm super busy this week, but I will try and fit it in in one afternoon anyway. Otherwise it will first be next week. |
Hi @enricogandini , |
I'll have a look @EBjerrum ! I didn't experience that issue with the CDDD sample file. Maybe I committed a wrong version. Btw, I am not sure exactly what you did: I see this Pull Request was closed, but I don't see the changes in main. I also see that you changed the base branch from |
Yes, i merged knto the branch pandas output. Maybe it should have been done differently, but the code is now in the pandas_output on the top repository.
Yahoo Mail: Søg, organiser, behersk
På man., den 18. mar. 2024 klokken 17:47, Enrico ***@***.***> skrev:
Hi @enricogandini , Looking at the code and your notebook. It however fails for me, as there are 5 rows with NaN values in the CDDD computations. It seems to be empty values in the .csv.gz itself. How did it work for you? was it an earlier version of the csv file that has since been updated? Otherwise we should either add some Nan imputation or filter the rows away.
I'll have a look @EBjerrum ! I didn't experience that issue with the CDDD sample file. Maybe I committed a wrong version.
Btw, I am not sure exactly what you did: I see this Pull Request was closed, but I don't see the changes in main. I also see that you changed the base branch from main to pandas_output.
image.png (view on web)
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hey @EBjerrum , I created a new pull request here, since the previous was closed: #40 . Yes, there were few rows with only NaNs in the CDDD sample file (molecules that didn't pass the CDDD validation). I didn't notice them in the notebook, since I am using a very recent version of scikit-learn, which by default doesn't crash on NaNs. Now this should work. If you merge it to the current "base" branch of scikit-mol (that is, the |
Closes #38 .
All scikit-mol transformer classes now support the scikit-learn
set_output
API. They can keep the data in DataFrames with meaningful column names.