-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: SeaSolver timeout #2211
Comments
It was an issue with our configuration. |
Turns out the issue still doesn't seem resolved. |
# Description Despite increasing the auction-solving share the average solving time for external solvers didn't increase. This PR adds a debug log that should allow us further pin down where the issue lies. # Changes <!-- List of detailed changes (how the change is accomplished) --> - [x] Print deadline when it is computed ## How to test Observe log statement locally. ## Related Issues #2211
# Description There seems to be an issue with the deadline we are passing down to solvers leading to timeouts while still having significant time left for postprocessing. Take Laertes in barn auction 8222857 for instance: > 2023-12-29T09:37:47.818Z DEBUG request{id="2710"}:/solve{solver=laertes-solve auction_id=8222857}: driver::infra::observe: computed deadline deadline=Deadline { driver: 2023-12-29T09:38:02.090509218Z, solvers: 2023-12-29T**09:38:00**.663289792Z } timeouts=Timeouts { http_delay: Duration { secs: 0, nanos: 500000000 }, solving_share_of_deadline: Percent(0.9) } > 2023-12-29T09:37:49.476Z TRACE request{id="2710"}:/solve: solvers::api::routes::solve: auction=Auction { id: Solve(8222857), ... deadline: Deadline(2023-12-29T**09:37:58**.702187297Z} ... > 2023-12-29T09:37:58.738Z WARN request{id="2710"}:/solve:auction{id=8222857}: solvers::domain::solver::legacy: failed to solve auction err=**Timeout** > 2023-12-29T09:37:58.739Z DEBUG request{id="2710"}:/solve{solver=laertes-solve auction_id=8222857}: driver::infra::observe: postprocessing solutions solutions=0 **remaining=3.351353404s** Here we aborted solver computation at 09:37:58 despite the original solver deadline (first log) being almost two seconds later (09:38:00). We can see that the deadline that is received by the solver engine is already much smaller than what we computed in the driver. Looking at the code we expect a reduction of `http_delay` (0.5s) but not 2s. One thing to note is that the way we pass down solver deadlines is surprisingly 🤡. We convert it into a duration to later on convert it into a timestamp again. My hunch is that this is causing us to lose precision and thus time. This PR simplifies this logic and hopes that this will resolve the 2s time loss. # Changes - [x] Remove SolverTimeout type and argument being passed around in favour of the timeouts that are already part of the auciton - [x] Return DateTime instead of durations for solver/driver deadlines - [x] Move `remaining` helper method into an extension trait to allow it being used where needed - [x] Remove the 500ms http-delay reduction in the request to the solver. We already have a buffer for postprocessing in the driver, and really it should be the consumer (solver engine in this case) who adjusts the deadline to take network latency into account. We do the same for the autopilot<>driver deadline (the driver attempts to send their response 500ms before the deadline) and in fact also already account for another 1s buffer inside the legacy solver-engine ([code](https://github.com/cowprotocol/services/blob/1a8261857a726ffa6180533dac548ff0a0b696be/crates/shared/src/http_solver.rs#L121-L126)) ## How to test Existing tests ## Related Issues Fixes #2211
I think I found the issue. During prioritisation we share the entire auction struct across threads, which includes the deadline services/crates/driver/src/domain/competition/auction.rs Lines 151 to 167 in ddeb9b9
Thus all auctions end up having the same deadline (the one of the first processed solver), which happens to be a solver with 80% solving time. |
# Description In order to test the bugfix for #2211 using driver tests we need to allow setting up a driver that is connected to multiple solver engines. Up until now all tests assumed a single driver <> solver engine setup. # Changes <!-- List of detailed changes (how the change is accomplished) --> - [x] Move solver related configuration params (name, slippage, etc) into its own struct - [x] Pass a `Vec<Solver>` into the driver instead of a single solver - [x] Expose methods to call `solve` for specific solvers - [x] Leave current single solver setup as default for convenience since most tests use this. ## How to test This is the test ## Related Issues Part of #2211
# Description For performance reason we use a future that is shared across requests for different solver engine instances when prioritising orders. However, this future returns the entire auction instead of just the list of orders. This means that other aspects of the auction, such as deadline are also shared across solver instances. This leads to solver engines that are configured with a specific deadline to sometimes receive auctions with deadlines that are configured for a different solver engine. # Changes - [x] Only share list of orders in the future (not the entire auction) ## How to test Added an integration test, which is failing before that change ## Related Issues Fixes #2211
# Description There seems to be an issue with the deadline we are passing down to solvers leading to timeouts while still having significant time left for postprocessing. Take Laertes in barn auction 8222857 for instance: > 2023-12-29T09:37:47.818Z DEBUG request{id="2710"}:/solve{solver=laertes-solve auction_id=8222857}: driver::infra::observe: computed deadline deadline=Deadline { driver: 2023-12-29T09:38:02.090509218Z, solvers: 2023-12-29T**09:38:00**.663289792Z } timeouts=Timeouts { http_delay: Duration { secs: 0, nanos: 500000000 }, solving_share_of_deadline: Percent(0.9) } > 2023-12-29T09:37:49.476Z TRACE request{id="2710"}:/solve: solvers::api::routes::solve: auction=Auction { id: Solve(8222857), ... deadline: Deadline(2023-12-29T**09:37:58**.702187297Z} ... > 2023-12-29T09:37:58.738Z WARN request{id="2710"}:/solve:auction{id=8222857}: solvers::domain::solver::legacy: failed to solve auction err=**Timeout** > 2023-12-29T09:37:58.739Z DEBUG request{id="2710"}:/solve{solver=laertes-solve auction_id=8222857}: driver::infra::observe: postprocessing solutions solutions=0 **remaining=3.351353404s** Here we aborted solver computation at 09:37:58 despite the original solver deadline (first log) being almost two seconds later (09:38:00). We can see that the deadline that is received by the solver engine is already much smaller than what we computed in the driver. Looking at the code we expect a reduction of `http_delay` (0.5s) but not 2s. One thing to note is that the way we pass down solver deadlines is surprisingly 🤡. We convert it into a duration to later on convert it into a timestamp again. My hunch is that this is causing us to lose precision and thus time. This PR simplifies this logic and hopes that this will resolve the 2s time loss. # Changes - [x] Remove SolverTimeout type and argument being passed around in favour of the timeouts that are already part of the auciton - [x] Return DateTime instead of durations for solver/driver deadlines - [x] Move `remaining` helper method into an extension trait to allow it being used where needed - [x] Remove the 500ms http-delay reduction in the request to the solver. We already have a buffer for postprocessing in the driver, and really it should be the consumer (solver engine in this case) who adjusts the deadline to take network latency into account. We do the same for the autopilot<>driver deadline (the driver attempts to send their response 500ms before the deadline) and in fact also already account for another 1s buffer inside the legacy solver-engine ([code](https://github.com/cowprotocol/services/blob/1a8261857a726ffa6180533dac548ff0a0b696be/crates/shared/src/http_solver.rs#L121-L126)) ## How to test Existing tests ## Related Issues Fixes #2211
# Description In order to test the bugfix for #2211 using driver tests we need to allow setting up a driver that is connected to multiple solver engines. Up until now all tests assumed a single driver <> solver engine setup. # Changes <!-- List of detailed changes (how the change is accomplished) --> - [x] Move solver related configuration params (name, slippage, etc) into its own struct - [x] Pass a `Vec<Solver>` into the driver instead of a single solver - [x] Expose methods to call `solve` for specific solvers - [x] Leave current single solver setup as default for convenience since most tests use this. ## How to test This is the test ## Related Issues Part of #2211
# Description For performance reason we use a future that is shared across requests for different solver engine instances when prioritising orders. However, this future returns the entire auction instead of just the list of orders. This means that other aspects of the auction, such as deadline are also shared across solver instances. This leads to solver engines that are configured with a specific deadline to sometimes receive auctions with deadlines that are configured for a different solver engine. # Changes - [x] Only share list of orders in the future (not the entire auction) ## How to test Added an integration test, which is failing before that change ## Related Issues Fixes #2211
Problem
SeaSolver reported receiving lower
time_limit
parameter.Right now, usually the value varies between 9-11s, while there are sometimes time limits of 7s or 6s.
One weird thing I noticed is this log:
https://production-6de61f.kb.eu-central-1.aws.cloud.es.io/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-24h%2Fh,to:now))&_a=(columns:!(log),filters:!(),index:c0e240e0-d9b3-11ed-b0e6-e361adffce0b,interval:auto,query:(language:kuery,query:'log:%20%228209421%22%20and%20%22seasolver%22'),sort:!(!('@timestamp',desc)))
External solvers should allocate 0.9 percent of available deadline for solving. SeaSolver timed out in this example, but still was left with 3.35s for the postprocessing? In the best case, it should be 15s * 0.1s = 1.5s.
Impact
External solvers less time for solving and timing out.
The text was updated successfully, but these errors were encountered: