Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On the first order implementation #2

Open
teffland opened this issue Sep 1, 2021 · 2 comments
Open

On the first order implementation #2

teffland opened this issue Sep 1, 2021 · 2 comments

Comments

@teffland
Copy link

teffland commented Sep 1, 2021

Hi,

I really enjoyed the paper -- a super interesting read and I learned a lot!

In case you're interested, I wanted to point you to an nonprojective entropy implementation in torch_struct that is quite efficient (should you have future use cases for it.) A similar derivation can be used for other additively factorable functions as well.

PR: harvardnlp/pytorch-struct#103

Colab comparison: https://colab.research.google.com/drive/1iUr78J901lMBlGVYpxSrRRmNJYX4FWyg?usp=sharing

Best,
Tom

@ranzmigrod
Copy link
Collaborator

Hi Tom,

Thank you for your interest in the paper and work! That is a very good use of torch_struct, indeed your derivation is very similar to ours with just a slightly different final equation. I think as you mentioned in the Colab (I added another version that purely uses our library) I think it may have to do with the constants as what we do with the marginals in both cases is virtually the same. I believe torch_struct is designed to be very efficient and so getting the marginal distribution from there rather than our implementation is a potential reason for the comparison. Thanks for pointing this out!

Kind regards,
Ran

@timvieira
Copy link

timvieira commented Sep 3, 2021

Possible explanations:

  • our version has a loop in your implementation, your method doesn't. Can you rewrite the code to make them more similar? (Vectorize both, or use loops for both)
  • is it possible that enabling nograd had a bunch of overhead?

Other thoughts:
It might also be a good idea to recompute the common stuff so that only the difference in the two methods is being compared.

In order to see the big-O behavior, do a log-log plot (that makes a function like time = C*N^K look linear log time = C + K * log N). The slope K should be roughly equal and less than 3 for both methods. The intercept is the overhead coefficient C.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants