Replies: 4 comments 2 replies
-
HI @jlparkI, Thank you for sharing your ideas. I don't right now have time to write a more thorough answer, but I will get back to you on this. In general, we are very much open to contributions like this. I do not see any major drawbacks in offering more SOAP options for users as long as a sensible default is available for typical users. |
Beta Was this translation helpful? Give feedback.
-
Hi @jlparkI,
|
Beta Was this translation helpful? Give feedback.
-
Hi Lauri,
|
Beta Was this translation helpful? Give feedback.
-
OK, I've re-implemented these compression schemes in a way that is hopefully less likely to lead to confusion (at the mu_1_nu_1_soap_compression fork. The SOAP constructor now takes the following:
Unless you have any reservations about these changes, if you think this is something that might be useful, I can submit as a pull request? Thanks again. |
Beta Was this translation helpful? Give feedback.
-
Hi,
Thanks for building this library, it's very useful. I had a couple of ideas for new features to contribute, one of which I've implemented on a fork, and I wanted to ask whether these might be useful to anyone else in the community / appropriate for the dscribe library.
The number of features in a SOAP descriptor exhibits quadratic scaling with the number of elements in the dataset, which makes them impractical for any system with a large number of elements. Several solutions to this problem have been proposed. One solution (originally proposed for ACSFs, see Gastegger et al. ) is to -- rather than generating separate power spectra for each element and "crossover" power spectra -- weight each element using some element-specific value (e.g. the atomic number or the electronegativity) when building the density for a local environment. The weight for that atom is then the weight that would be calculated via the distance weighting (if the user has selected some form of distance weighting) times the element-specific weight. The number of features in the resulting descriptor vector is invariant to the number of elements. Consequently, if using (for example) n_max = 12 and l_max = 8 on a five-element system, we generate 702 features for each center rather than > 10,000.
I've written a rough preliminary implementation of this option as an additional class,
ElementWeightedSOAP
, under thecompressed_features
branch on this fork (seeelweightedsoap.py
under descriptors andelweightedsoap.cpp
,elweightedsoapGTO.cpp
under ext). In a preliminary test on QM9 for prediction of internal energy at 298K, using a random features-approximated Gaussian process with a GraphRBF kernel as the model and with a combination of power-weighting for distance with Pauling electronegativity weighting for atom type, I was able to get a mean absolute error < 0.3 kcal/mol on a 20,000 molecule test set without too much effort.Additionally, there are several schemes for reducing the size of the SOAP feature descriptor discussed in Darby et al. The one which I think might be worth implementing is under the "Generalized Kernel" section of the paper, which would scale linearly with the number of elements present. (They also briefly mention as a possibility the scheme I've described above.) They demonstrate that using at least one of these representations, they can achieve good performance on QM9 and several other datasets using kernel ridge regression with a polynomial kernel as the model (mean absolute error on QM9 for prediction of internal energy of about 0.3 kcal/mol). Indeed, for \mu=1, \nu=1, they achieve roughly the same performance as uncompressed SOAP.
Would either approach potentially be useful to anyone else in the community, and if so, is the dscribe library the appropriate place to implement them? There are some obvious advantages, but there is also a possible drawback in that they might make it more complicated for end users to choose a good descriptor for their intended application. Thanks for any thoughts / suggestions you have.
Beta Was this translation helpful? Give feedback.
All reactions