Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This leverages some pretty cool functionality the PyTorch team is playing around with in https://github.com/pytorch/functorch to let us replace our bespoke derivative kernel implementations with a single fully autograd based one. You can use it like this:
The only change necessary outside of actually implementing the kernel was that currently
vmap
can't batch overtorch.equal
, so I madex1_eq_x2
an argument we can specify / set in__call__
to bypass the comparison here:gpytorch/gpytorch/kernels/kernel.py
Line 300 in fc2053b
Problems
For some reason, the Hessian block of
DerivativeKernel(MaternKernel(nu=2.5))
specifically is just the negative of what it should be. This is the only kernel this happens for as far as I can tell. I have no idea why, but it's annoying since Matern would be obviously a fantastic kernel to get a derivative version of "for free." I suspect this has to do with the non squared distance computations here?If you wrap a non-differentiable kernel in
DerivativeKernel
, it will still return a matrix but a super non-pd one. I don't think there's a good solution, but it's problematic since if we're not using Cholesky (likely with default settings since derivative kernel matrices get really big really fast) I'm not sure we'd fail loudly anywhere along the way?