bug: SeaSolver timeout #2211

sunce86 · 2023-12-26T16:49:12Z

Problem

SeaSolver reported receiving lower time_limit parameter.
Right now, usually the value varies between 9-11s, while there are sometimes time limits of 7s or 6s.

One weird thing I noticed is this log:
https://production-6de61f.kb.eu-central-1.aws.cloud.es.io/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-24h%2Fh,to:now))&_a=(columns:!(log),filters:!(),index:c0e240e0-d9b3-11ed-b0e6-e361adffce0b,interval:auto,query:(language:kuery,query:'log:%20%228209421%22%20and%20%22seasolver%22'),sort:!(!('@timestamp',desc)))

External solvers should allocate 0.9 percent of available deadline for solving. SeaSolver timed out in this example, but still was left with 3.35s for the postprocessing? In the best case, it should be 15s * 0.1s = 1.5s.

Impact

External solvers less time for solving and timing out.

The text was updated successfully, but these errors were encountered:

fleupold · 2023-12-26T17:20:13Z

There is another 500ms network delay in the driver (code) + 1s network delay in the legacy solver (code), but yeah things still look a bit off

fleupold · 2023-12-27T13:00:35Z

It was an issue with our configuration.

fleupold · 2023-12-28T17:38:05Z

Turns out the issue still doesn't seem resolved.

# Description Despite increasing the auction-solving share the average solving time for external solvers didn't increase. This PR adds a debug log that should allow us further pin down where the issue lies. # Changes  - [x] Print deadline when it is computed ## How to test Observe log statement locally. ## Related Issues #2211

fleupold · 2023-12-29T09:45:42Z

Example logs: https://production-6de61f.kb.eu-central-1.aws.cloud.es.io/app/r/s/Jjjnt

# Description There seems to be an issue with the deadline we are passing down to solvers leading to timeouts while still having significant time left for postprocessing. Take Laertes in barn auction 8222857 for instance: > 2023-12-29T09:37:47.818Z DEBUG request{id="2710"}:/solve{solver=laertes-solve auction_id=8222857}: driver::infra::observe: computed deadline deadline=Deadline { driver: 2023-12-29T09:38:02.090509218Z, solvers: 2023-12-29T**09:38:00**.663289792Z } timeouts=Timeouts { http_delay: Duration { secs: 0, nanos: 500000000 }, solving_share_of_deadline: Percent(0.9) } > 2023-12-29T09:37:49.476Z TRACE request{id="2710"}:/solve: solvers::api::routes::solve: auction=Auction { id: Solve(8222857), ... deadline: Deadline(2023-12-29T**09:37:58**.702187297Z} ... > 2023-12-29T09:37:58.738Z WARN request{id="2710"}:/solve:auction{id=8222857}: solvers::domain::solver::legacy: failed to solve auction err=**Timeout** > 2023-12-29T09:37:58.739Z DEBUG request{id="2710"}:/solve{solver=laertes-solve auction_id=8222857}: driver::infra::observe: postprocessing solutions solutions=0 **remaining=3.351353404s** Here we aborted solver computation at 09:37:58 despite the original solver deadline (first log) being almost two seconds later (09:38:00). We can see that the deadline that is received by the solver engine is already much smaller than what we computed in the driver. Looking at the code we expect a reduction of `http_delay` (0.5s) but not 2s. One thing to note is that the way we pass down solver deadlines is surprisingly 🤡. We convert it into a duration to later on convert it into a timestamp again. My hunch is that this is causing us to lose precision and thus time. This PR simplifies this logic and hopes that this will resolve the 2s time loss. # Changes - [x] Remove SolverTimeout type and argument being passed around in favour of the timeouts that are already part of the auciton - [x] Return DateTime instead of durations for solver/driver deadlines - [x] Move `remaining` helper method into an extension trait to allow it being used where needed - [x] Remove the 500ms http-delay reduction in the request to the solver. We already have a buffer for postprocessing in the driver, and really it should be the consumer (solver engine in this case) who adjusts the deadline to take network latency into account. We do the same for the autopilot<>driver deadline (the driver attempts to send their response 500ms before the deadline) and in fact also already account for another 1s buffer inside the legacy solver-engine ([code](https://github.com/cowprotocol/services/blob/1a8261857a726ffa6180533dac548ff0a0b696be/crates/shared/src/http_solver.rs#L121-L126)) ## How to test Existing tests ## Related Issues Fixes #2211

fleupold · 2023-12-30T13:22:08Z

I think I found the issue. During prioritisation we share the entire auction struct across threads, which includes the deadline

services/crates/driver/src/domain/competition/auction.rs

Lines 151 to 167 in ddeb9b9

    
                   let fut = tokio::task::spawn_blocking(move || { 
        
                       let start = std::time::Instant::now(); 
        
                       Self::sort(&mut auction); 
        
                       let mut balances = 
        
                           rt.block_on(async { Self::fetch_balances(&eth, &auction.orders).await }); 
        
                       Self::filter_orders(&mut balances, &mut auction.orders, &eth); 
        
                       tracing::debug!(auction_id = new_id.0, time =? start.elapsed(), "auction preprocessing done"); 
        
                       Arc::new(auction) 
        
                   }) 
        
                   .map(|res| { 
        
                       res.expect( 
        
                           "Either runtime was shut down before spawning the task or no OS threads are \ 
        
                            available; no sense in handling those errors", 
        
                       ) 
        
                   }) 
        
                   .boxed() 
        
                   .shared();

Thus all auctions end up having the same deadline (the one of the first processed solver), which happens to be a solver with 80% solving time.

# Description In order to test the bugfix for #2211 using driver tests we need to allow setting up a driver that is connected to multiple solver engines. Up until now all tests assumed a single driver <> solver engine setup. # Changes  - [x] Move solver related configuration params (name, slippage, etc) into its own struct - [x] Pass a `Vec<Solver>` into the driver instead of a single solver - [x] Expose methods to call `solve` for specific solvers - [x] Leave current single solver setup as default for convenience since most tests use this. ## How to test This is the test ## Related Issues Part of #2211

# Description For performance reason we use a future that is shared across requests for different solver engine instances when prioritising orders. However, this future returns the entire auction instead of just the list of orders. This means that other aspects of the auction, such as deadline are also shared across solver instances. This leads to solver engines that are configured with a specific deadline to sometimes receive auctions with deadlines that are configured for a different solver engine. # Changes - [x] Only share list of orders in the future (not the entire auction) ## How to test Added an integration test, which is failing before that change ## Related Issues Fixes #2211

# Description There seems to be an issue with the deadline we are passing down to solvers leading to timeouts while still having significant time left for postprocessing. Take Laertes in barn auction 8222857 for instance: > 2023-12-29T09:37:47.818Z DEBUG request{id="2710"}:/solve{solver=laertes-solve auction_id=8222857}: driver::infra::observe: computed deadline deadline=Deadline { driver: 2023-12-29T09:38:02.090509218Z, solvers: 2023-12-29T**09:38:00**.663289792Z } timeouts=Timeouts { http_delay: Duration { secs: 0, nanos: 500000000 }, solving_share_of_deadline: Percent(0.9) } > 2023-12-29T09:37:49.476Z TRACE request{id="2710"}:/solve: solvers::api::routes::solve: auction=Auction { id: Solve(8222857), ... deadline: Deadline(2023-12-29T**09:37:58**.702187297Z} ... > 2023-12-29T09:37:58.738Z WARN request{id="2710"}:/solve:auction{id=8222857}: solvers::domain::solver::legacy: failed to solve auction err=**Timeout** > 2023-12-29T09:37:58.739Z DEBUG request{id="2710"}:/solve{solver=laertes-solve auction_id=8222857}: driver::infra::observe: postprocessing solutions solutions=0 **remaining=3.351353404s** Here we aborted solver computation at 09:37:58 despite the original solver deadline (first log) being almost two seconds later (09:38:00). We can see that the deadline that is received by the solver engine is already much smaller than what we computed in the driver. Looking at the code we expect a reduction of `http_delay` (0.5s) but not 2s. One thing to note is that the way we pass down solver deadlines is surprisingly 🤡. We convert it into a duration to later on convert it into a timestamp again. My hunch is that this is causing us to lose precision and thus time. This PR simplifies this logic and hopes that this will resolve the 2s time loss. # Changes - [x] Remove SolverTimeout type and argument being passed around in favour of the timeouts that are already part of the auciton - [x] Return DateTime instead of durations for solver/driver deadlines - [x] Move `remaining` helper method into an extension trait to allow it being used where needed - [x] Remove the 500ms http-delay reduction in the request to the solver. We already have a buffer for postprocessing in the driver, and really it should be the consumer (solver engine in this case) who adjusts the deadline to take network latency into account. We do the same for the autopilot<>driver deadline (the driver attempts to send their response 500ms before the deadline) and in fact also already account for another 1s buffer inside the legacy solver-engine ([code](https://github.com/cowprotocol/services/blob/1a8261857a726ffa6180533dac548ff0a0b696be/crates/shared/src/http_solver.rs#L121-L126)) ## How to test Existing tests ## Related Issues Fixes #2211

# Description In order to test the bugfix for #2211 using driver tests we need to allow setting up a driver that is connected to multiple solver engines. Up until now all tests assumed a single driver <> solver engine setup. # Changes  - [x] Move solver related configuration params (name, slippage, etc) into its own struct - [x] Pass a `Vec<Solver>` into the driver instead of a single solver - [x] Expose methods to call `solve` for specific solvers - [x] Leave current single solver setup as default for convenience since most tests use this. ## How to test This is the test ## Related Issues Part of #2211

# Description For performance reason we use a future that is shared across requests for different solver engine instances when prioritising orders. However, this future returns the entire auction instead of just the list of orders. This means that other aspects of the auction, such as deadline are also shared across solver instances. This leads to solver engines that are configured with a specific deadline to sometimes receive auctions with deadlines that are configured for a different solver engine. # Changes - [x] Only share list of orders in the future (not the entire auction) ## How to test Added an integration test, which is failing before that change ## Related Issues Fixes #2211

sunce86 added the bug Something isn't working label Dec 26, 2023

fleupold closed this as completed Dec 27, 2023

fleupold mentioned this issue Dec 28, 2023

[Trivial] log deadline #2221

Merged

1 task

fleupold reopened this Dec 28, 2023

fleupold mentioned this issue Dec 29, 2023

Simplify deadline passing #2222

Merged

4 tasks

fleupold closed this as completed in #2222 Dec 30, 2023

fleupold reopened this Dec 30, 2023

This was referenced Jan 2, 2024

Allow to configure multiple solvers with driver tests #2233

Merged

Fix deadline #2234

Merged

fleupold closed this as completed in #2234 Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: SeaSolver timeout #2211

bug: SeaSolver timeout #2211

sunce86 commented Dec 26, 2023

fleupold commented Dec 26, 2023

fleupold commented Dec 27, 2023

fleupold commented Dec 28, 2023

fleupold commented Dec 29, 2023

fleupold commented Dec 30, 2023

bug: SeaSolver timeout #2211

bug: SeaSolver timeout #2211

Comments

sunce86 commented Dec 26, 2023

Problem

Impact

fleupold commented Dec 26, 2023

fleupold commented Dec 27, 2023

fleupold commented Dec 28, 2023

fleupold commented Dec 29, 2023

fleupold commented Dec 30, 2023