Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve gradient ascent implementation #193

Open
b-r-u opened this issue Sep 2, 2020 · 2 comments
Open

Improve gradient ascent implementation #193

b-r-u opened this issue Sep 2, 2020 · 2 comments

Comments

@b-r-u
Copy link
Member

b-r-u commented Sep 2, 2020

There are two related issues with the current gradient ascent implementation.

1. Attributes are assumed to be independent

The documentation states that attributes are assumed to be independent and thus they are optimized independently. But if two or more attributes are dependent this approach will likely fail to find the local optimum.

Considering a function like f(x,y) = (x-y)^2+(0.5*(x+y))^2. Starting from the lower left quadrant the current approach will also end up in the lower left quadrant and not reach the optimum at x=0, y=0.

contour

2. Central difference is expensive

The current implementation uses a central difference to approximate the gradient which takes two samples - one from each side of the current point. While optimizing a single attribute, one of these two samples is likely cached from the previous step and does not have to be computed again. But when changing more than one attribute (as I propose) new samples are unlikely to be cached.

A forward/backward difference only needs one sample per attribute in addition to the current point.

A proposal

The gradient at point (x , y) with the fitness function f could be computed by using the forward difference for each attribute independently (or by switching to the backward difference if a value would be out of bounds):

gradient_x = f(x + step_x, y         ) - f(x, y)
gradient_y = f(x         , y + step_y) - f(x, y)

New individuals can now be generated along the direction of the gradient by first normalizing the gradient and then adding multiples of the gradient to the current point.

# A scaling factor
scale = 1.0 / max(gradient_x, gradient_y)`

# Add multiples of the gradient to the current point
for i in range(1, 10):
    new_point = f(
       x + round(i * gradient_x * scale) * step_x,
       y + round(i * gradient_y * scale) * step_y,
    )
@stefansc1
Copy link
Contributor

I see two main issues with your approach:

A forward difference makes assumptions about the backward difference

What if one attribute is already at an optimum? Consider the function f(x,y) = -x² + g(y) that shall be maximized. Assume a step size of 1 and the optimum at x=0 has already been found, although y is still unknown. Following your proposal, the gradients are

gradient_x = f(x + step_x, y) - f(x,y) = f(1,y) - f(0,y) = (-(1²) + g(y)) - (-0² + g(y)) = -1
gradient_y = g(y + 1) - g(y)

Ignoring the scale, the next point would be at x = -1, y = whatever. But we already know that x=0 is the optimum! Because in your proposal you only check one side (which in this case is worse) you make an automatic assumption about the other side (namely, that in this case it must be better, when in fact it isn't). All your new points will be worse than the one you started at.

The obvious solution is to compute the gradients using points around the current value, e.g. +val_step and -val_step. This doubles the computations needed.

Our problems are multi-objective

All the examples so far had only one objective. Our optimization has two objectives. Therefore, you get two different gradients for each attribute. The current implementation sidesteps this problem by checking dominance between generations.

@b-r-u
Copy link
Member Author

b-r-u commented Sep 10, 2020

A forward difference makes assumptions about the backward difference

[...] All your new points will be worse than the one you started at.

You're right, I did not consider this case! When dealing with only one attribute it's fine to move away from the optimum, because you will find out at the next step and hopefully not too much time is wasted by computing multiples in the wrong direction. But when dealing with multiple attributes at the same time it's important that optimal dimensions have a gradient of zero.

Our problems are multi-objective

I'm not sure how to approach gradient ascent with multiple objectives, yet. There is some research when searching for "Multiple-Gradient Descent Algorithm". In my understanding it would still be beneficial to find the optimum for each objective as these points are guaranteed to be on the pareto front.

The implementation could perform a gradient ascent for each objective and maybe generate a third gradient which is a random linear interpolation between these two gradients.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants