Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement to Forward mode #96

Open
soraxas opened this issue May 17, 2021 · 5 comments
Open

Improvement to Forward mode #96

soraxas opened this issue May 17, 2021 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@soraxas
Copy link

soraxas commented May 17, 2021

Hi @JamesYang007!

This repo seems pretty nice, especially your philosophy as laid out in your paper of arguing the benefit of having a pair of pointers of matrices, rather than a matrix of dual numbers.

I had been using https://github.com/autodiff/autodiff for a while which overloads Eigen's scalar type (i.e. the latter approach) to use a matrix of dual numbers, and I think there are quite a bit of overhead (and cache misses) when compared to the reference function (matrix of doubles). I wanted to test out your repo, but realised that this repo had been mainly focusing on reversed mode rather than forward mode (which is the focus of https://github.com/autodiff/autodiff). Do you have any plans to make some of the implementations in https://github.com/JamesYang007/FastAD/tree/master/include/fastad_bits/reverse/core applicable to both modes? More than that, it seems like forward mode right now only works with scalar types (float/double) rather than matrix/vector type?

Finally, one huge benefit of https://github.com/autodiff/autodiff is that it is immediately applicable to a lot of existing function/solvers (since it is using custom scalar type to do autodiff), while your approach requires manual work to implements different operators (e.g. all the ad::XXX custom operator rather than the Eigen::XXX operator) and https://github.com/autodiff/autodiff immediately works with any function that takes a templated Eigen argument. Do you have any thoughts on that? (I had thought of a possible approach could be that we can extend MatrixBase rather than using custom scalar type to keep track of the adjoints of the input variables.)

@JamesYang007
Copy link
Owner

JamesYang007 commented May 30, 2021

Hey @soraxas! Thanks for checking out this repo and the paper! Apologies for the late response, my email notification was set to skip the inbox (rip :( ).

I'll look into fleshing out the forward mode more maybe next week, but I'll start thinking about it now. I haven't been able to continue developing this whole school year, but I think I'll have a little bit of time soon. I remember I didn't think too much about forward mode because I couldn't think of a way to avoid having matrix of duals.

I agree one of the big downsides of fastad is that we can't reuse existing functions. Though, I had a talk about this with some of the Stan developers a while back and they argued that even though Stan math library allows this reusing of existing functions, they ended up rewriting a lot of the drivers like ODE solvers because it was worth the extra optimization. And even if we were to extend Eigen types, existing functions that use Eigen objects cannot work because the way fastad builds expressions and evaluates them is actually quite different from how Eigen does things. One big difference is that the user has to explicitly call autodiff on an AD expression to actually perform the AD operations. Another is that some expressions like the assignment expression like Matrix x; x = expr; is interpreted differently - Eigen would evaluate the expression and store into the matrix x while fastad overloads operator= to return an expression object representing the assignment (no evaluation done). So, existing functions cannot be reused because it's just logically doing something very different if you treat the input as AD variables vs regular Eigen matrices of doubles. I think I would need to rewrite the expression-building logic entirely to be able to reuse existing functions. I tried my best when I was coming up with fastad to have this feature, but I always had to sacrifice performance at some point, which I really did not want to do.

I'd be really happy to discuss any design approaches you might have for solving this problem! And I'll keep you posted about the forward mode.

@JamesYang007 JamesYang007 added the enhancement New feature or request label May 30, 2021
@JamesYang007 JamesYang007 self-assigned this May 30, 2021
@JamesYang007
Copy link
Owner

I'll add that I think there's a way to reuse existing functions for forward mode with no performance sacrifice. The above is strictly talking about reverse mode.

@soraxas
Copy link
Author

soraxas commented May 31, 2021

Thank you very much for your detailed message and your thoughts on the matter! And all is good, no need to worry about delay on these things when life is busy.

I do agree that it does make sense on how specialising certain functions or solvers could achieve further optimisation. However, at that point I think directly deriving the derivative function might allow you to immediately eliminate certain terms and without the need to keep track of adjoints to squeeze the last bit of performance. But the availability of algorithmic differentiation had definitely ease this process to be more approachable.

That is an interesting note on the inner working of how FastADstores its intermediate representation and never fully evaluate the expression until explicitly calling the AD operation.
A question on that note is that does it means it's hard to eagerly evaluate intermediate results in order to avoid re-computation? (e.g. for results that will be used several times throughout the expression) Would that be similar to what using the auto keyword in Eigen? (see https://eigen.tuxfamily.org/dox/TopicPitfalls.html)

And once again thank you for your detailed and enthusiastic response. I myself am not in a research area that focuses on AD, but I've only recently stumbled upon the inner workings of AD when I was trying to implement computing gradient in C++ for GMM and related algorithm. It's really fascinating on how we applied chain rule in such an algorithmic way and different approaches (forward/backward) for varying benefits, as well as with the need of consideration on the more hardware level to optimise the AD operations. I don't know the topic in-depth but I'm happy to help:)

@JamesYang007
Copy link
Owner

JamesYang007 commented Jun 1, 2021

Re: optimization, actually AD provides more benefit than simply automating the derivation of a derivative function, which I only realized as I was writing this library. In most cases, we need to compute both the function value as well as its derivative. In this case, you can actually automate caching of certain results for backward evaluation during reverse AD. As a simple example, the exponential function's derivative is precisely the function value. So, in FastAD, any time there's a computation for exponential, it will always (automatically) just read the function value rather than recompute it. This performance boost is huge especially because exponential function is very expensive, and making such an optimization if you manually write out the derivative function will not be scalable once the function gets slightly complicated.

There may be other reasons why we might want to rewrite certain drivers such as memory management. For example, assuming that we could even reuse an existing solver, it might, for example, make unnecessary re-allocations of the values/adjoints if we blindly feed in our AD variables. It is then worth rewriting some of the logic to just allocate a region of memory one time and reuse this region. This logic is very specific to AD procedure (and the AD library), so if you want this kind of a performance boost, there's no hope of reusing existing function templates, as I see it.

Re: saving intermediate results, you can save them! I call them "placeholder expressions" in my README, which you can read more about. Basically, x = expr returns another expression that represents this assignment, and you "chain" it with subsequent expressions that use x. So, even the assignment is done lazily.

I'm actually not in AD research myself xD I just took up this project for fun back in undergrad. I'm really glad to hear you're interested in this topic though! I find AD incredibly satisfying and super cool because it's such a fundamental tool in a lot of applications today.

@soraxas
Copy link
Author

soraxas commented Jun 1, 2021

I totally agree with your comment on the huge benefit of automating the process of reusing expression within the derivative function itself. When I was manually deriving the derivative function for the GMM I did ended up pre-computing and reusing (caching) terms that are used throughout the expression, especially the exponential part. Furthermore, in general, there might be some terms used in f(x) that are not needed in computing df(x)/dx if you only care about the gradient and not the function value (where in AD a compiler may or may not ended up optimising them out). However, as you had mentioned deriving manually is cumbersome and does not scale well, so auto-AD is definitely much more beneficial.

On a related note, this article on AD is very interesting to me as it talks about the benefit of allowing the compiler to performs optimisation, then source-to-source transforms f(x) to df(x)/dx, then performs another compiler optimisation again (see fig. 8). The idea is very ingenious to me, but their implementation is in llvm and I can't even begin to grasp the code haha

Oh I didn't notice that section, from what it seems it looks very decent and looks like it would be very beneficial to reduce computation, will definitely try it out. And haha I didn't notice this project had been around for quite a while! It's great to see someone who is intrigued by a wide range of topics. And yeah I've mostly just been using autograd from packages like PyTorch as a black box without fully understanding the inner workings, and it's only after I've tried to look into it did I realise how much is going on under the hood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants