Possible contribution / new feature for SOAP #108

jlparkI · 2023-05-19T18:15:51Z

jlparkI
May 19, 2023

Hi,

Thanks for building this library, it's very useful. I had a couple of ideas for new features to contribute, one of which I've implemented on a fork, and I wanted to ask whether these might be useful to anyone else in the community / appropriate for the dscribe library.

The number of features in a SOAP descriptor exhibits quadratic scaling with the number of elements in the dataset, which makes them impractical for any system with a large number of elements. Several solutions to this problem have been proposed. One solution (originally proposed for ACSFs, see Gastegger et al. ) is to -- rather than generating separate power spectra for each element and "crossover" power spectra -- weight each element using some element-specific value (e.g. the atomic number or the electronegativity) when building the density for a local environment. The weight for that atom is then the weight that would be calculated via the distance weighting (if the user has selected some form of distance weighting) times the element-specific weight. The number of features in the resulting descriptor vector is invariant to the number of elements. Consequently, if using (for example) n_max = 12 and l_max = 8 on a five-element system, we generate 702 features for each center rather than > 10,000.

I've written a rough preliminary implementation of this option as an additional class, ElementWeightedSOAP, under the compressed_features branch on this fork (see elweightedsoap.py under descriptors and elweightedsoap.cpp, elweightedsoapGTO.cpp under ext). In a preliminary test on QM9 for prediction of internal energy at 298K, using a random features-approximated Gaussian process with a GraphRBF kernel as the model and with a combination of power-weighting for distance with Pauling electronegativity weighting for atom type, I was able to get a mean absolute error < 0.3 kcal/mol on a 20,000 molecule test set without too much effort.

Additionally, there are several schemes for reducing the size of the SOAP feature descriptor discussed in Darby et al. The one which I think might be worth implementing is under the "Generalized Kernel" section of the paper, which would scale linearly with the number of elements present. (They also briefly mention as a possibility the scheme I've described above.) They demonstrate that using at least one of these representations, they can achieve good performance on QM9 and several other datasets using kernel ridge regression with a polynomial kernel as the model (mean absolute error on QM9 for prediction of internal energy of about 0.3 kcal/mol). Indeed, for \mu=1, \nu=1, they achieve roughly the same performance as uncompressed SOAP.

Would either approach potentially be useful to anyone else in the community, and if so, is the dscribe library the appropriate place to implement them? There are some obvious advantages, but there is also a possible drawback in that they might make it more complicated for end users to choose a good descriptor for their intended application. Thanks for any thoughts / suggestions you have.

lauri-codes · 2023-05-30T05:37:20Z

lauri-codes
May 30, 2023
Maintainer

HI @jlparkI,

Thank you for sharing your ideas. I don't right now have time to write a more thorough answer, but I will get back to you on this. In general, we are very much open to contributions like this. I do not see any major drawbacks in offering more SOAP options for users as long as a sensible default is available for typical users.

1 reply

jlparkI Jun 21, 2023
Author

Hi Lauri,

Thanks, I appreciate it. Two more quick questions then.

I've implemented the mu=1 nu=1 soap compression scheme from Darby et al. (see https://github.com/jlparkI/dscribe, mu_1_nu_1_soap_compression branch); the docs and the unit tests are updated, tests all pass; currently derivatives for this option are numerical only. I implemented this as a new averaging option; instead of average="off" or average="outer", etc. the user can now specify average="m1n1_compression", which will result in the use of this feature compression scheme. This repo illustrates the use of m1n1 compressed features to model the QM9 dataset with a random features approximated GP, achieving an MAE < 0.35 kcal/mol, which suggests this may be useful for at least some applications. So -- do you think this modification is worth adding to dscribe, and if so, should I submit a pull request for this?
I originally implemented element-specific weighting as a new class, but the more I think about it, the more I'm not terribly happy with that approach -- there would be a lot of redundancy between the new class and the original SOAP class. I think it's probably better if this is useful to add element-specific weighting as another option under the weighting dict passed to the SOAP constructor. Do you think that's reasonable, or would that clutter up the SOAP constructor too much (given that there are already many options the user can pass)?

Thanks again.

lauri-codes · 2023-06-21T06:25:53Z

lauri-codes
Jun 21, 2023
Maintainer

Hi @jlparkI,

I have not read through the linked article, but do you think it makes sense to provide the options under the average-argument? So far this argument has only been used to specify an average across different SOAP centers (returns a single vector no matter how many centers were specified). Maybe the compression deserves it's own argument if it can be performed irrespective of whether averaging is done or not..? But anyways you can definitely create a pull request for this feature and I can take a look.
I would definitely favor including this kind of option directly in the existing SOAP class if at all possible. I don't see a problem with providing many options: only the more experienced users will touch them anyways.

0 replies

jlparkI · 2023-06-21T13:45:31Z

jlparkI
Jun 21, 2023
Author

Hi Lauri,

Good point. it is probably confusing to add this as a form of averaging. Let me revise this to use a new flag, probably called "compression", that can be set to "m1n1" or "off". I'll do that then update this discussion shortly...

0 replies

jlparkI · 2023-06-23T01:50:28Z

jlparkI
Jun 23, 2023
Author

OK, I've re-implemented these compression schemes in a way that is hopefully less likely to lead to confusion (at the mu_1_nu_1_soap_compression fork. The SOAP constructor now takes the following:

    def __init__(
        self,
        r_cut=None,
        n_max=None,
        l_max=None,
        sigma=1.0,
        rbf="gto",
        weighting=None,
        crossover=True,
        average="off",
        compression="off",
        species=None,
        species_weighting=None,
        periodic=False,
        sparse=False,
        dtype="float64",
    )

compression can be one of "off" (default), "m1n1" or "agnostic". If "agnostic", all elements are treated as the same species -- the only difference between them is any species-specific weighting supplied under species_weighting, which can be either None (in which case no species weighting is applied) or a dict (if a dict, it is required to contain the same keys as species, and must map each key to a weight for that element). compression="m1n1" provides the mu=1, nu=1 compression from Darby et al. Compression can be used in conjunction with averaging, so you can select both e.g. compression="m1n1" and average="outer" or compression="agnostic" and average="inner", and with any of the other options. For now, if compression != "off", only numerical derivatives are allowed. Both compression schemes seem to work OK on QM9. Unit tests are updated to provide additional coverage.

Unless you have any reservations about these changes, if you think this is something that might be useful, I can submit as a pull request? Thanks again.

1 reply

lauri-codes Jun 30, 2023
Maintainer

Yes, please create a pull request. I will then give additional comments there. But the interface does look pretty OK to me 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible contribution / new feature for SOAP #108

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Possible contribution / new feature for SOAP #108

jlparkI May 19, 2023

Replies: 4 comments · 2 replies

lauri-codes May 30, 2023 Maintainer

jlparkI Jun 21, 2023 Author

lauri-codes Jun 21, 2023 Maintainer

jlparkI Jun 21, 2023 Author

jlparkI Jun 23, 2023 Author

lauri-codes Jun 30, 2023 Maintainer

jlparkI
May 19, 2023

Replies: 4 comments 2 replies

lauri-codes
May 30, 2023
Maintainer

jlparkI Jun 21, 2023
Author

lauri-codes
Jun 21, 2023
Maintainer

jlparkI
Jun 21, 2023
Author

jlparkI
Jun 23, 2023
Author

lauri-codes Jun 30, 2023
Maintainer