Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: SeaSolver timeout #2211

Closed
sunce86 opened this issue Dec 26, 2023 · 5 comments · Fixed by #2222 or #2234
Closed

bug: SeaSolver timeout #2211

sunce86 opened this issue Dec 26, 2023 · 5 comments · Fixed by #2222 or #2234
Labels
bug Something isn't working

Comments

@sunce86
Copy link
Contributor

sunce86 commented Dec 26, 2023

Problem

SeaSolver reported receiving lower time_limit parameter.
Right now, usually the value varies between 9-11s, while there are sometimes time limits of 7s or 6s.

One weird thing I noticed is this log:
https://production-6de61f.kb.eu-central-1.aws.cloud.es.io/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-24h%2Fh,to:now))&_a=(columns:!(log),filters:!(),index:c0e240e0-d9b3-11ed-b0e6-e361adffce0b,interval:auto,query:(language:kuery,query:'log:%20%228209421%22%20and%20%22seasolver%22'),sort:!(!('@timestamp',desc)))

External solvers should allocate 0.9 percent of available deadline for solving. SeaSolver timed out in this example, but still was left with 3.35s for the postprocessing? In the best case, it should be 15s * 0.1s = 1.5s.

Impact

External solvers less time for solving and timing out.

@sunce86 sunce86 added the bug Something isn't working label Dec 26, 2023
@fleupold
Copy link
Contributor

There is another 500ms network delay in the driver (code) + 1s network delay in the legacy solver (code), but yeah things still look a bit off

@fleupold
Copy link
Contributor

It was an issue with our configuration.

@fleupold
Copy link
Contributor

Turns out the issue still doesn't seem resolved.

@fleupold fleupold reopened this Dec 28, 2023
fleupold added a commit that referenced this issue Dec 29, 2023
# Description
Despite increasing the auction-solving share the average solving time
for external solvers didn't increase. This PR adds a debug log that
should allow us further pin down where the issue lies.

# Changes
<!-- List of detailed changes (how the change is accomplished) -->

- [x] Print deadline when it is computed

## How to test
Observe log statement locally.

## Related Issues
#2211
@fleupold
Copy link
Contributor

fleupold added a commit that referenced this issue Dec 30, 2023
# Description
There seems to be an issue with the deadline we are passing down to
solvers leading to timeouts while still having significant time left for
postprocessing. Take Laertes in barn auction 8222857 for instance:

> 2023-12-29T09:37:47.818Z DEBUG
request{id="2710"}:/solve{solver=laertes-solve auction_id=8222857}:
driver::infra::observe: computed deadline deadline=Deadline { driver:
2023-12-29T09:38:02.090509218Z, solvers:
2023-12-29T**09:38:00**.663289792Z } timeouts=Timeouts { http_delay:
Duration { secs: 0, nanos: 500000000 }, solving_share_of_deadline:
Percent(0.9) }
> 2023-12-29T09:37:49.476Z TRACE request{id="2710"}:/solve:
solvers::api::routes::solve: auction=Auction { id: Solve(8222857), ...
deadline: Deadline(2023-12-29T**09:37:58**.702187297Z}
...
> 2023-12-29T09:37:58.738Z WARN
request{id="2710"}:/solve:auction{id=8222857}:
solvers::domain::solver::legacy: failed to solve auction err=**Timeout**
> 2023-12-29T09:37:58.739Z DEBUG
request{id="2710"}:/solve{solver=laertes-solve auction_id=8222857}:
driver::infra::observe: postprocessing solutions solutions=0
**remaining=3.351353404s**

Here we aborted solver computation at 09:37:58 despite the original
solver deadline (first log) being almost two seconds later (09:38:00).
We can see that the deadline that is received by the solver engine is
already much smaller than what we computed in the driver. Looking at the
code we expect a reduction of `http_delay` (0.5s) but not 2s.

One thing to note is that the way we pass down solver deadlines is
surprisingly 🤡. We convert it into a duration to later on convert it
into a timestamp again. My hunch is that this is causing us to lose
precision and thus time. This PR simplifies this logic and hopes that
this will resolve the 2s time loss.

# Changes
- [x] Remove SolverTimeout type and argument being passed around in
favour of the timeouts that are already part of the auciton
- [x] Return DateTime instead of durations for solver/driver deadlines
- [x] Move `remaining` helper method into an extension trait to allow it
being used where needed
- [x] Remove the 500ms http-delay reduction in the request to the
solver.

We already have a buffer for postprocessing in the driver, and really it
should be the consumer (solver engine in this case) who adjusts the
deadline to take network latency into account. We do the same for the
autopilot<>driver deadline (the driver attempts to send their response
500ms before the deadline) and in fact also already account for another
1s buffer inside the legacy solver-engine
([code](https://github.com/cowprotocol/services/blob/1a8261857a726ffa6180533dac548ff0a0b696be/crates/shared/src/http_solver.rs#L121-L126))

## How to test
Existing tests

## Related Issues

Fixes #2211
@fleupold
Copy link
Contributor

I think I found the issue. During prioritisation we share the entire auction struct across threads, which includes the deadline

let fut = tokio::task::spawn_blocking(move || {
let start = std::time::Instant::now();
Self::sort(&mut auction);
let mut balances =
rt.block_on(async { Self::fetch_balances(&eth, &auction.orders).await });
Self::filter_orders(&mut balances, &mut auction.orders, &eth);
tracing::debug!(auction_id = new_id.0, time =? start.elapsed(), "auction preprocessing done");
Arc::new(auction)
})
.map(|res| {
res.expect(
"Either runtime was shut down before spawning the task or no OS threads are \
available; no sense in handling those errors",
)
})
.boxed()
.shared();

Thus all auctions end up having the same deadline (the one of the first processed solver), which happens to be a solver with 80% solving time.

@fleupold fleupold reopened this Dec 30, 2023
fleupold added a commit that referenced this issue Jan 3, 2024
# Description
In order to test the bugfix for #2211 using driver tests we need to
allow setting up a driver that is connected to multiple solver engines.
Up until now all tests assumed a single driver <> solver engine setup.

# Changes
<!-- List of detailed changes (how the change is accomplished) -->

- [x] Move solver related configuration params (name, slippage, etc)
into its own struct
- [x] Pass a `Vec<Solver>` into the driver instead of a single solver
- [x] Expose methods to call `solve` for specific solvers
- [x] Leave current single solver setup as default for convenience since
most tests use this.

## How to test
This is the test

## Related Issues

Part of #2211
fleupold added a commit that referenced this issue Jan 4, 2024
# Description
For performance reason we use a future that is shared across requests
for different solver engine instances when prioritising orders. However,
this future returns the entire auction instead of just the list of
orders. This means that other aspects of the auction, such as deadline
are also shared across solver instances.

This leads to solver engines that are configured with a specific
deadline to sometimes receive auctions with deadlines that are
configured for a different solver engine.

# Changes
- [x] Only share list of orders in the future (not the entire auction)

## How to test
Added an integration test, which is failing before that change

## Related Issues

Fixes #2211
sunce86 pushed a commit that referenced this issue Jan 6, 2024
# Description
There seems to be an issue with the deadline we are passing down to
solvers leading to timeouts while still having significant time left for
postprocessing. Take Laertes in barn auction 8222857 for instance:

> 2023-12-29T09:37:47.818Z DEBUG
request{id="2710"}:/solve{solver=laertes-solve auction_id=8222857}:
driver::infra::observe: computed deadline deadline=Deadline { driver:
2023-12-29T09:38:02.090509218Z, solvers:
2023-12-29T**09:38:00**.663289792Z } timeouts=Timeouts { http_delay:
Duration { secs: 0, nanos: 500000000 }, solving_share_of_deadline:
Percent(0.9) }
> 2023-12-29T09:37:49.476Z TRACE request{id="2710"}:/solve:
solvers::api::routes::solve: auction=Auction { id: Solve(8222857), ...
deadline: Deadline(2023-12-29T**09:37:58**.702187297Z}
...
> 2023-12-29T09:37:58.738Z WARN
request{id="2710"}:/solve:auction{id=8222857}:
solvers::domain::solver::legacy: failed to solve auction err=**Timeout**
> 2023-12-29T09:37:58.739Z DEBUG
request{id="2710"}:/solve{solver=laertes-solve auction_id=8222857}:
driver::infra::observe: postprocessing solutions solutions=0
**remaining=3.351353404s**

Here we aborted solver computation at 09:37:58 despite the original
solver deadline (first log) being almost two seconds later (09:38:00).
We can see that the deadline that is received by the solver engine is
already much smaller than what we computed in the driver. Looking at the
code we expect a reduction of `http_delay` (0.5s) but not 2s.

One thing to note is that the way we pass down solver deadlines is
surprisingly 🤡. We convert it into a duration to later on convert it
into a timestamp again. My hunch is that this is causing us to lose
precision and thus time. This PR simplifies this logic and hopes that
this will resolve the 2s time loss.

# Changes
- [x] Remove SolverTimeout type and argument being passed around in
favour of the timeouts that are already part of the auciton
- [x] Return DateTime instead of durations for solver/driver deadlines
- [x] Move `remaining` helper method into an extension trait to allow it
being used where needed
- [x] Remove the 500ms http-delay reduction in the request to the
solver.

We already have a buffer for postprocessing in the driver, and really it
should be the consumer (solver engine in this case) who adjusts the
deadline to take network latency into account. We do the same for the
autopilot<>driver deadline (the driver attempts to send their response
500ms before the deadline) and in fact also already account for another
1s buffer inside the legacy solver-engine
([code](https://github.com/cowprotocol/services/blob/1a8261857a726ffa6180533dac548ff0a0b696be/crates/shared/src/http_solver.rs#L121-L126))

## How to test
Existing tests

## Related Issues

Fixes #2211
sunce86 pushed a commit that referenced this issue Jan 6, 2024
# Description
In order to test the bugfix for #2211 using driver tests we need to
allow setting up a driver that is connected to multiple solver engines.
Up until now all tests assumed a single driver <> solver engine setup.

# Changes
<!-- List of detailed changes (how the change is accomplished) -->

- [x] Move solver related configuration params (name, slippage, etc)
into its own struct
- [x] Pass a `Vec<Solver>` into the driver instead of a single solver
- [x] Expose methods to call `solve` for specific solvers
- [x] Leave current single solver setup as default for convenience since
most tests use this.

## How to test
This is the test

## Related Issues

Part of #2211
sunce86 pushed a commit that referenced this issue Jan 6, 2024
# Description
For performance reason we use a future that is shared across requests
for different solver engine instances when prioritising orders. However,
this future returns the entire auction instead of just the list of
orders. This means that other aspects of the auction, such as deadline
are also shared across solver instances.

This leads to solver engines that are configured with a specific
deadline to sometimes receive auctions with deadlines that are
configured for a different solver engine.

# Changes
- [x] Only share list of orders in the future (not the entire auction)

## How to test
Added an integration test, which is failing before that change

## Related Issues

Fixes #2211
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants