Improve performance in CPU-bound programs #622

aantron · 2018-09-19T15:16:08Z

Some CPU-bound Lwt programs call Lwt.pause () a lot, in order to yield to any potential I/O that may be pending.

However, doing this results in Lwt calling select/kevent/epoll at the rate that Lwt.pause () is called. This forces the Lwt user to worry about Lwt implementation details, and to think about how often they are calling Lwt.pause ().

We should probably have Lwt adapt automatically by checking if polling I/O the last time actually resolved any promises, and, if not, skipping calls to select, etc., on future scheduler iterations, with some kind of limited exponential backoff, or a similar scheme.

See https://discuss.ocaml.org/t/2567/5. This will also improve performance of CPU-bound Repromise programs.

cc @kennetpostigo

The text was updated successfully, but these errors were encountered:

Lupus · 2019-10-14T10:36:17Z

We've also faced this problem. While testing fairness of large data stream processing with httpaf we observed that one stream hogs all of the processing power and all other streams just time out.

Probably due to data coming at high rate, read loop keeps reading from the socket without blocking (and thus invoking event loop).

We just added Lwt_main.yield () in-between read loop steps, but that resulted in insane rate of calls to epoll and tremendous slowdown as the result.

We're thinking about yielding once per certain amount of bytes read, but that looks a bit weird to solve at application level. This issue has libuv milestone, does that mean that implementing some heuristic within current Lwt is not considered viable? Does application-level workaround have any drawbacks compared to heuristic within Lwt itself?

aantron · 2019-10-14T13:53:26Z

You may also find Lwt_unix.auto_yield useful, with current Lwt:

lwt/src/unix/lwt_unix.cppo.mli

Lines 58 to 65 in 336566d

    
           val auto_yield : float -> (unit -> unit Lwt.t) 
        
             (** [auto_yield timeout] returns a function [f], and [f ()] has the following 
        
                 behavior: 
        
                 - If it has been more than [timeout] seconds since the last time [f ()] 
        
                   behaved like {!Lwt_unix.yield}, [f ()] calls {!Lwt_unix.yield}. 
        
                 - Otherwise, if it has been less than [timeout] seconds, [f ()] behaves 
        
                   like {!Lwt.return_unit}, i.e. it does not yield. *)

This is indeed best solved inside the scheduler. The only reason for the libuv milestone is that until now, the only places I had observed this issue, that definitely required an in-library fix, were related to some libuv work I was doing (in repromise_lwt and luv).

Do you have time to work on this in Lwt? If not, could you share your test/benchmark so I can use it to measure effects of various approaches, when I work on this (slightly later)?

Lupus · 2019-10-14T15:55:36Z

I'll try using auto_yield, looks like it does not have any dependencies on Lwt internals and I can just try embedding it in my service directly. So far it looks like it's going to solve the issue with unfair streams.

As of benchmark, basically we just replicate lwt_echo_post.ml example from httpaf with our service. Httpaf example itself should be sufficient to illustrate the issue. I recommend modifying it like below to avoid excessive buffering (see httpaf/139 for more context):

--- a/examples/lib/httpaf_examples.ml
+++ b/examples/lib/httpaf_examples.ml
@@ -39,8 +39,9 @@ module Server = struct
       let request_body  = Reqd.request_body reqd in
       let response_body = Reqd.respond_with_streaming reqd response in                                                                                                      
       let rec on_read buffer ~off ~len =                                                                                                                                    
-        Body.write_bigstring response_body buffer ~off ~len;                                                                                                                
-        Body.schedule_read request_body ~on_eof ~on_read;                                                                                                                   
+        Body.schedule_bigstring response_body buffer ~off ~len;                                                                                                             
+        Body.flush response_body (fun () ->                                                                                                                                 
+        Body.schedule_read request_body ~on_eof ~on_read);                                                                                                                  
       and on_eof () =                                                                                                                                                       
         Body.close_writer response_body                                                                                                                                     
       in

Clients are simple curl launches like this:

dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/ -o /dev/null

You might need to add header specifying chunked-encoding response, but it should work without it as well.

I might find some time in future to try implementing this in Lwt, depending on mitigation effect from auto_yield :)

Lupus · 2019-10-14T16:21:45Z

Looks like auto_yield(0.05) does not cut it so far... I'll try lower values, but it already starts hurting performance.

[kolkhovskiy@home ~]$ dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/echo -o /dev/null 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.1G    0 6230M    0 6242M   115M   115M --:--:--  0:00:54 --:--:--  176M
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

[kolkhovskiy@home ~]$ dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/echo -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9370M    0 4679M    0 4690M  75.6M  75.7M --:--:--  0:01:01 --:--:--  103M
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

[kolkhovskiy@home ~]$ dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/echo -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26.4G    0 13.2G    0 13.2G   111M   111M --:--:--  0:02:01 --:--:-- 34.0M
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

[kolkhovskiy@home ~]$ dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/echo -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 45.8G    0 22.9G    0 22.9G   153M   153M --:--:--  0:02:32 --:--:-- 6608k

Lupus · 2019-10-14T16:46:58Z

But changing yield interval to each X bytes read/written works better! Yielding each megabyte gives nearly the same perf as without yields, but multiple streams share the bandwidth fairly.

aantron · 2019-10-14T16:49:53Z

Ok that's good :)

aantron · 2019-10-14T17:27:25Z

@Lupus, I guess another library solution to your case would be to add a variant of yield that yields only to other callbacks (CPU) and not I/O. But before adding something like that, I would want to see if there is a generic solution that addresses all these cases, some variant of what I described in the main comment of this issue.

Lupus · 2019-10-18T08:37:21Z

a variant of yield that yields only to other callbacks (CPU) and not I/O

Yeah, that should probably work as well. When all of your sockets always have data to read, you don't need event loop iteration :) On the other hand when there's a slow connection happening along with a fast one, there won't be any fairness in this scenario, fast one will hog CPU if it uses "yield only to other CPU guys" strategy.

aantron mentioned this issue Nov 26, 2018

Test performance of CPU-bound Luv programs based on Luv.Integration aantron/luv#2

Closed

aantron added this to the libuv milestone Jul 27, 2019

aantron modified the milestones: libuv, 4.5.0 Oct 14, 2019

aantron modified the milestones: 4.5.0, 5.1.0 Dec 15, 2019

Lupus mentioned this issue Feb 15, 2020

Fairness across multiple streams anmonteiro/httpun#41

Closed

samoht mentioned this issue Dec 9, 2020

irmin-pack: allocate less mirage/irmin#1159

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance in CPU-bound programs #622

Improve performance in CPU-bound programs #622

aantron commented Sep 19, 2018

Lupus commented Oct 14, 2019

aantron commented Oct 14, 2019

Lupus commented Oct 14, 2019

Lupus commented Oct 14, 2019

Lupus commented Oct 14, 2019

aantron commented Oct 14, 2019

aantron commented Oct 14, 2019

Lupus commented Oct 18, 2019

Improve performance in CPU-bound programs #622

Improve performance in CPU-bound programs #622

Comments

aantron commented Sep 19, 2018

Lupus commented Oct 14, 2019

aantron commented Oct 14, 2019

Lupus commented Oct 14, 2019

Lupus commented Oct 14, 2019

Lupus commented Oct 14, 2019

aantron commented Oct 14, 2019

aantron commented Oct 14, 2019

Lupus commented Oct 18, 2019