Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance in CPU-bound programs #622

Open
aantron opened this issue Sep 19, 2018 · 8 comments
Open

Improve performance in CPU-bound programs #622

aantron opened this issue Sep 19, 2018 · 8 comments
Milestone

Comments

@aantron
Copy link
Collaborator

aantron commented Sep 19, 2018

Some CPU-bound Lwt programs call Lwt.pause () a lot, in order to yield to any potential I/O that may be pending.

However, doing this results in Lwt calling select/kevent/epoll at the rate that Lwt.pause () is called. This forces the Lwt user to worry about Lwt implementation details, and to think about how often they are calling Lwt.pause ().

We should probably have Lwt adapt automatically by checking if polling I/O the last time actually resolved any promises, and, if not, skipping calls to select, etc., on future scheduler iterations, with some kind of limited exponential backoff, or a similar scheme.

See https://discuss.ocaml.org/t/2567/5. This will also improve performance of CPU-bound Repromise programs.

cc @kennetpostigo

@Lupus
Copy link

Lupus commented Oct 14, 2019

We've also faced this problem. While testing fairness of large data stream processing with httpaf we observed that one stream hogs all of the processing power and all other streams just time out.

Probably due to data coming at high rate, read loop keeps reading from the socket without blocking (and thus invoking event loop).

We just added Lwt_main.yield () in-between read loop steps, but that resulted in insane rate of calls to epoll and tremendous slowdown as the result.

We're thinking about yielding once per certain amount of bytes read, but that looks a bit weird to solve at application level. This issue has libuv milestone, does that mean that implementing some heuristic within current Lwt is not considered viable? Does application-level workaround have any drawbacks compared to heuristic within Lwt itself?

@aantron
Copy link
Collaborator Author

aantron commented Oct 14, 2019

You may also find Lwt_unix.auto_yield useful, with current Lwt:

val auto_yield : float -> (unit -> unit Lwt.t)
(** [auto_yield timeout] returns a function [f], and [f ()] has the following
behavior:
- If it has been more than [timeout] seconds since the last time [f ()]
behaved like {!Lwt_unix.yield}, [f ()] calls {!Lwt_unix.yield}.
- Otherwise, if it has been less than [timeout] seconds, [f ()] behaves
like {!Lwt.return_unit}, i.e. it does not yield. *)

This is indeed best solved inside the scheduler. The only reason for the libuv milestone is that until now, the only places I had observed this issue, that definitely required an in-library fix, were related to some libuv work I was doing (in repromise_lwt and luv).

Do you have time to work on this in Lwt? If not, could you share your test/benchmark so I can use it to measure effects of various approaches, when I work on this (slightly later)?

@aantron aantron modified the milestones: libuv, 4.5.0 Oct 14, 2019
@Lupus
Copy link

Lupus commented Oct 14, 2019

I'll try using auto_yield, looks like it does not have any dependencies on Lwt internals and I can just try embedding it in my service directly. So far it looks like it's going to solve the issue with unfair streams.

As of benchmark, basically we just replicate lwt_echo_post.ml example from httpaf with our service. Httpaf example itself should be sufficient to illustrate the issue. I recommend modifying it like below to avoid excessive buffering (see httpaf/139 for more context):

--- a/examples/lib/httpaf_examples.ml
+++ b/examples/lib/httpaf_examples.ml
@@ -39,8 +39,9 @@ module Server = struct
       let request_body  = Reqd.request_body reqd in
       let response_body = Reqd.respond_with_streaming reqd response in                                                                                                      
       let rec on_read buffer ~off ~len =                                                                                                                                    
-        Body.write_bigstring response_body buffer ~off ~len;                                                                                                                
-        Body.schedule_read request_body ~on_eof ~on_read;                                                                                                                   
+        Body.schedule_bigstring response_body buffer ~off ~len;                                                                                                             
+        Body.flush response_body (fun () ->                                                                                                                                 
+        Body.schedule_read request_body ~on_eof ~on_read);                                                                                                                  
       and on_eof () =                                                                                                                                                       
         Body.close_writer response_body                                                                                                                                     
       in

Clients are simple curl launches like this:

dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/ -o /dev/null

You might need to add header specifying chunked-encoding response, but it should work without it as well.

I might find some time in future to try implementing this in Lwt, depending on mitigation effect from auto_yield :)

@Lupus
Copy link

Lupus commented Oct 14, 2019

Looks like auto_yield(0.05) does not cut it so far... I'll try lower values, but it already starts hurting performance.

[kolkhovskiy@home ~]$ dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/echo -o /dev/null 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12.1G    0 6230M    0 6242M   115M   115M --:--:--  0:00:54 --:--:--  176M
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

[kolkhovskiy@home ~]$ dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/echo -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9370M    0 4679M    0 4690M  75.6M  75.7M --:--:--  0:01:01 --:--:--  103M
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

[kolkhovskiy@home ~]$ dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/echo -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26.4G    0 13.2G    0 13.2G   111M   111M --:--:--  0:02:01 --:--:-- 34.0M
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

[kolkhovskiy@home ~]$ dd if=/dev/zero bs=1M | curl -XPOST -T - http://127.0.0.1:8080/echo -o /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 45.8G    0 22.9G    0 22.9G   153M   153M --:--:--  0:02:32 --:--:-- 6608k

@Lupus
Copy link

Lupus commented Oct 14, 2019

But changing yield interval to each X bytes read/written works better! Yielding each megabyte gives nearly the same perf as without yields, but multiple streams share the bandwidth fairly.

@aantron
Copy link
Collaborator Author

aantron commented Oct 14, 2019

Ok that's good :)

@aantron
Copy link
Collaborator Author

aantron commented Oct 14, 2019

@Lupus, I guess another library solution to your case would be to add a variant of yield that yields only to other callbacks (CPU) and not I/O. But before adding something like that, I would want to see if there is a generic solution that addresses all these cases, some variant of what I described in the main comment of this issue.

@Lupus
Copy link

Lupus commented Oct 18, 2019

a variant of yield that yields only to other callbacks (CPU) and not I/O

Yeah, that should probably work as well. When all of your sockets always have data to read, you don't need event loop iteration :) On the other hand when there's a slow connection happening along with a fast one, there won't be any fairness in this scenario, fast one will hog CPU if it uses "yield only to other CPU guys" strategy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants