Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
For fun, kind of, I'm poking at reducing the memory usage of the standalone example, so more of it can run on my lower end system.
I've only skimmed the paper so far, but while looking at the code I'm encountering some small confusions or questions regarding the implementation, so I'm opening this pull request to connect a little bit.
I'll answer these questions myself when and if I find the answers.
Questions:
I noticed the use of the kernel norm is detached from the gradient graph and cached between runs. How come this doesn't result in a disparate kernel norm as the kernels are updating during training?
I noticed the use of
F.interpolate
does not specifyalign_corners
, leaving it to default to False, which it looks like to me can leave some flat artefacts at the edges, and stretch the content between them by a subsample, when the interpolation is linear. Does this matter? My intuition would have been to do linear interpolation by dropping the last sample or wrapping to the first. In my changes, I had to add a constant of 1 to the input size to get the same interpolation output for truncated kernels.Why is it helpful to scale the weights of the kernels by their distance? Wouldn't the training process learn this scaling itself?
Small changes to the backend can result in small (on the order of 1/100th) changes to outputs unless a lot of care is taken. How important is that kind of numerical stability?
Resolved:
n=2*L
in the fft is to avoid performing the circular convolution. (this took me some learning)Idea:
Given the principles of this algorithm, it looks to me like it might be possible to run using very minimal ram, by using the fft to perform the kernel interpolation in frequency space, and streaming the convolution.