Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mixed tanimoto gp surrogate #318

Merged
merged 33 commits into from
Jan 29, 2024

Conversation

xxEthene
Copy link
Contributor

This PR adds a new surrogate type, MixedTanimotoGPSurrogate, which is designed to be used with datasets that contains MolecularInput's with Fingerprints, Fragments, FingerprintsFragments molecular features, and Continuous and/or Categorical features. Therefore, this surrogate is analogous to MixedSingleTaskGP except it involves MolecularInputs's. In order to provide the flexibility for this surrogate type, some other changes were made which will also be mentioned below:

  • For MixedTanimotoGPSurrogate, the continous, categorical, and molecular kernels are combined where the final covar_module = sum of kernels + product of kernels. This is analogous to how the continuous and categorical kernels are combined in MixedSingleTaskGP.
  • TanimotoGPSurrogate has been modified to only work with Fingerprints, Fragments, FingerprintsFragments molecular features.
  • MolecularInput's with MordredDescriptors molecular features can be used with SingleTaskGP and MixedSingleTaskGP now.
  • Therefore, MolecularInput's with MordredDescriptors molecular features do not require TanimotoGPSurrogate or MixedTanimotoGPSurrogate. This change allows scalarization of the mordred descriptors, and allows continuous kernels to be used with them too (analogous to how descriptors in CategoricalDescriptorInput are treated). For example, this means that when a dataset contains a MolecularInput with MordredDescriptors molecular features and another MolecularInput with FingerprintsFragments molecular features, they will affected by different kernels and scalars in a MixedTanimotoGPSurrogate.

Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Li,

thanks for this nice PR. Looks overally very good! I let some comments.

Best,

Johannes

bofire/surrogates/mixed_tanimoto_gp.py Outdated Show resolved Hide resolved
bofire/surrogates/mixed_tanimoto_gp.py Show resolved Hide resolved
bofire/surrogates/mixed_tanimoto_gp.py Show resolved Hide resolved
bofire/surrogates/mixed_tanimoto_gp.py Show resolved Hide resolved
bofire/surrogates/mixed_tanimoto_gp.py Outdated Show resolved Hide resolved
bofire/surrogates/single_task_gp.py Outdated Show resolved Hide resolved
tests/bofire/surrogates/test_gps.py Show resolved Hide resolved
@xxEthene
Copy link
Contributor Author

I have made the changes as suggested.

In addition, I have also slightly cleaned up the MixedSingleTaskGPSurrogate _fit function to use the get_continuous_features, get_categorical_features, and get_feature_indices functions.

Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Li,

looks really good, just a few minor things.

cc: @simonsung06

Best,

Johannes

bofire/data_models/domain/features.py Outdated Show resolved Hide resolved
bofire/data_models/domain/features.py Outdated Show resolved Hide resolved
bofire/data_models/domain/features.py Show resolved Hide resolved
bofire/data_models/domain/features.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Li,

looks overall good, I let only some minor comments. Sorry for being so picky ...

cc: @simonsung06

Best,

Johannes

bofire/surrogates/utils.py Outdated Show resolved Hide resolved
bofire/surrogates/utils.py Outdated Show resolved Hide resolved
tests/bofire/surrogates/test_gps.py Outdated Show resolved Hide resolved
tests/bofire/surrogates/test_gps.py Outdated Show resolved Hide resolved
tests/bofire/surrogates/test_tanimoto_related.py Outdated Show resolved Hide resolved
@jduerholt
Copy link
Contributor

Another small info: we merged in a PR which has refactored the test suite (#327). You need to merge main again into your branch, but the effects will be small. You just have to move the stuff which creates conflicts to the new positions.

In case of problems, I can also help you or do it together with you!

@xxEthene
Copy link
Contributor Author

Hi @jduerholt, I have merged my branch with the main branch and solved the conflicts :) Please help me check whether it is correct. Thanks for all your constructive feedback!

@jduerholt
Copy link
Contributor

jduerholt commented Dec 22, 2023

Hi @xxEthene,

many thanks for the updates. I will have a look on them over the christmas days. Sorry for the delay!

Best,

Johannes

@xxEthene
Copy link
Contributor Author

Hi @jduerholt, I have made some changes to the codes based on the errors previously occurred. I accidentally removed the priors importing in mixed_single_task_gp and kept the test_features.py file when I merged with the main branch. These have been solved and I have moved some tests to the correct place :) However, for the rest failed tests related to strategies, I am not sure about it as I did not modify these files and the tests have passed on my computer. I am looking forward to your feedback on this! Wish you a great Christmas holiday and happy new year~~

@jduerholt
Copy link
Contributor

Hi @xxEthene; just ignore the failing tests, this is due to a new version of formulaic released on the 25th of December (https://pypi.org/project/formulaic/#history) which breaks our tests. @Osburg, can you take care for this?

Regarding your PR: I will do a final review as soon as I am back in office next Tuesday! I wish you also a happy new year!

Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks already very good, only some small change requests!

ord_dims = sorted(set(range(d)) - set(cat_dims) - set(mol_dims)) # type: ignore

if cont_kernel_factory is None:
cont_kernel_factory = kernels.map_MaternKernel( # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# these are the categorical dimesions after applying the OneHotToNumeric transform
cat_dims = list(
range(len(ord_dims), len(ord_dims) + len(non_numerical_features))
range(len(ord_dims), len(ord_dims) + len(categorical_feature_keys))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use also here get_feature_indices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_feature_indices returns categorical dimensions with OneHot transformation applied, but we need the categorical dimensions without the transformation here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct, I overlooked it. This is then also the position in the code, where we run into problems, when we have for example CategoricalDescriptorInputs of which we use some as OneHots and some as descriptors. This breaks then the indices, because we just rely here that the categorical features are always the last ones (here also the order_id comes into the play). This is often the case, but not always. And with the molecular ones coming in, it could be that it less often the case. But this bug, already existed before your PR. So we do not have to fix it here, but maybe you have a smart idea for the problem? ;)

bofire/surrogates/mixed_tanimoto_gp.py Show resolved Hide resolved
scaler_enum,
input_preprocessing_specs,
expected_scaler,
expected_indices_length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only the indices legth and not the indices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it to check the indices instead :)

@jduerholt
Copy link
Contributor

I stumbled over this order vs order_id issue also when working on this PR #279, the change to order_id will result in some failing tests in test_inputs.py, I fixed it in the other PR, you can just copy it over. I would not recommend to merge the other PR into yours.

@jduerholt
Copy link
Contributor

In PR #332 the problems regarding the failing tests in the DoE module are fixed. As soon as it is merged into main, you can merge it in from main.

@jduerholt
Copy link
Contributor

Hi @xxEthene, I let some comments. Sorry for this mess with the order_id. I will investigate the issue from the botorch side further. Just try to change it in the way that I proposed above, and check if it works then.

@jduerholt
Copy link
Contributor

Hi @xxEthene, I created a PR in botorch fixing the issue with the OneHotToNumeric InputTransform. If you are interested, here is the PR: pytorch/botorch#2166

@jduerholt
Copy link
Contributor

Hi @xxEthene, the PR was now merged. Just tell me if you find time to finish this PR, if not I will try to finish it ;)

@xxEthene
Copy link
Contributor Author

Hi @xxEthene, the PR was now merged. Just tell me if you find time to finish this PR, if not I will try to finish it ;)

Hi @jduerholt, sorry for the delay as my computer is under repairment recently. I will be working on the codes this weekend!

@xxEthene
Copy link
Contributor Author

Hi @jduerholt, the order_id values are as follows for now:

  • ContinuousInput: 1
  • ContinuousDescriptorInput: 2
  • DiscreteInput: 3
  • MolecularInput: 4
  • CategoricalDescriptorInput: 5
  • CategoricalInput: 6

and I have modified related tests in test_inputs.

@jduerholt
Copy link
Contributor

jduerholt commented Jan 22, 2024

Hi @jduerholt, the order_id values are as follows for now:

  • ContinuousInput: 1
  • ContinuousDescriptorInput: 2
  • DiscreteInput: 3
  • MolecularInput: 4
  • CategoricalDescriptorInput: 5
  • CategoricalInput: 6

and I have modified related tests in test_inputs.

Hi @xxEthene,

looks good for me, just one thing: you forgot the CategoricalMolecularInput, this one has still order_id 7, it should be between MolecularInput and CategoricalDescriptorInput and carrying the number 5. The following ones should then be 6 and 7.

Can you update this?

Best,

Johannes

@xxEthene
Copy link
Contributor Author

Hi @jduerholt,

Sorry for this and I have updated the order. I have also formatted the files based on the error messages received. However, the error for the file bofire/strategies/samplers/universal_constraint.py is not detected from my side....I am not very sure about this one.

Best regards,
Yuxin

@jduerholt
Copy link
Contributor

However, the error for the file bofire/strategies/samplers/universal_constraint.py is not detected from my side.

Ignore it for now, I think there is some other test regarding the sorting of the features still failing. Can you have a look on this one too? Sorry for iterating this for such a long time!

@xxEthene
Copy link
Contributor Author

Hi @jduerholt,

The failing tests regarding the sorting of the features are due to the unchanged order_id for outputs. I am so sorry for this. I have updated the order_id for outputs and the order now is:

  • ContinuousInput: 1
  • ContinuousDescriptorInput: 2
  • DiscreteInput: 3
  • MolecularInput: 4
  • CategoricalMolecularInput: 5
  • CategoricalDescriptorInput: 6
  • CategoricalInput: 7
  • ContinuousOutput: 8
  • CategoricalOutput: 9

Hope it can work now!

Best regards,
Yuxin

Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost done. Thank you @xxEthene! Just format the one file or revert your change in universal_constraint.py.

samples = samples.iloc[
self.num_candidates :,
]
samples = samples.iloc[self.num_candidates :,]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a problem for the formatter, just revert it to the original one and then it should be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I just committed it!

@xxEthene
Copy link
Contributor Author

Hi @jduerholt, I am so confused by the error messages received from the tests....as everything goes well on my computer and also I did not change anything except the a few lines in universal_constraint.py.

@jduerholt
Copy link
Contributor

It is strange, that it is not occuring locally for you, but the error comes from my side. I overlooked in one of my last PRs something and this seems to be the reason for this behavior. I will put up a PR today and merge it in, then you can merge main again into your PR.

Sorry for this!

@jduerholt
Copy link
Contributor

Should be fixed now, just merge main in. Big sorry for this!

@xxEthene
Copy link
Contributor Author

Okay, I have done it! :)

Copy link
Contributor

@jduerholt jduerholt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @xxEthene,

thank you very much for all your efforts. And sorry for the tedious review process!

Best,

Johannes

@jduerholt jduerholt merged commit afae85c into experimental-design:main Jan 29, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants