Improve gradient ascent implementation #193

b-r-u · 2020-09-02T15:11:51Z

There are two related issues with the current gradient ascent implementation.

1. Attributes are assumed to be independent

The documentation states that attributes are assumed to be independent and thus they are optimized independently. But if two or more attributes are dependent this approach will likely fail to find the local optimum.

Considering a function like f(x,y) = (x-y)^2+(0.5*(x+y))^2. Starting from the lower left quadrant the current approach will also end up in the lower left quadrant and not reach the optimum at x=0, y=0.

2. Central difference is expensive

The current implementation uses a central difference to approximate the gradient which takes two samples - one from each side of the current point. While optimizing a single attribute, one of these two samples is likely cached from the previous step and does not have to be computed again. But when changing more than one attribute (as I propose) new samples are unlikely to be cached.

A forward/backward difference only needs one sample per attribute in addition to the current point.

A proposal

The gradient at point (x , y) with the fitness function f could be computed by using the forward difference for each attribute independently (or by switching to the backward difference if a value would be out of bounds):

gradient_x = f(x + step_x, y         ) - f(x, y)
gradient_y = f(x         , y + step_y) - f(x, y)

New individuals can now be generated along the direction of the gradient by first normalizing the gradient and then adding multiples of the gradient to the current point.

# A scaling factor
scale = 1.0 / max(gradient_x, gradient_y)`

# Add multiples of the gradient to the current point
for i in range(1, 10):
    new_point = f(
       x + round(i * gradient_x * scale) * step_x,
       y + round(i * gradient_y * scale) * step_y,
    )

The text was updated successfully, but these errors were encountered:

stefansc1 · 2020-09-07T09:55:47Z

I see two main issues with your approach:

A forward difference makes assumptions about the backward difference

What if one attribute is already at an optimum? Consider the function f(x,y) = -x² + g(y) that shall be maximized. Assume a step size of 1 and the optimum at x=0 has already been found, although y is still unknown. Following your proposal, the gradients are

gradient_x = f(x + step_x, y) - f(x,y) = f(1,y) - f(0,y) = (-(1²) + g(y)) - (-0² + g(y)) = -1
gradient_y = g(y + 1) - g(y)

Ignoring the scale, the next point would be at x = -1, y = whatever. But we already know that x=0 is the optimum! Because in your proposal you only check one side (which in this case is worse) you make an automatic assumption about the other side (namely, that in this case it must be better, when in fact it isn't). All your new points will be worse than the one you started at.

The obvious solution is to compute the gradients using points around the current value, e.g. +val_step and -val_step. This doubles the computations needed.

Our problems are multi-objective

All the examples so far had only one objective. Our optimization has two objectives. Therefore, you get two different gradients for each attribute. The current implementation sidesteps this problem by checking dominance between generations.

b-r-u · 2020-09-10T08:39:34Z

A forward difference makes assumptions about the backward difference

[...] All your new points will be worse than the one you started at.

You're right, I did not consider this case! When dealing with only one attribute it's fine to move away from the optimum, because you will find out at the next step and hopefully not too much time is wasted by computing multiples in the wrong direction. But when dealing with multiple attributes at the same time it's important that optimal dimensions have a gradient of zero.

Our problems are multi-objective

I'm not sure how to approach gradient ascent with multiple objectives, yet. There is some research when searching for "Multiple-Gradient Descent Algorithm". In my understanding it would still be beneficial to find the optimum for each objective as these points are guaranteed to be on the pareto front.

The implementation could perform a gradient ascent for each objective and maybe generate a third gradient which is a random linear interpolation between these two gradients.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve gradient ascent implementation #193

Improve gradient ascent implementation #193

b-r-u commented Sep 2, 2020

stefansc1 commented Sep 7, 2020

b-r-u commented Sep 10, 2020

A forward difference makes assumptions about the backward difference

Our problems are multi-objective

Improve gradient ascent implementation #193

Improve gradient ascent implementation #193

Comments

b-r-u commented Sep 2, 2020

1. Attributes are assumed to be independent

2. Central difference is expensive

A proposal

stefansc1 commented Sep 7, 2020

A forward difference makes assumptions about the backward difference

Our problems are multi-objective

b-r-u commented Sep 10, 2020

A forward difference makes assumptions about the backward difference

Our problems are multi-objective