-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor dequeue to return option #51
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the httpaf PR, this one introduces allocation which needs to be avoided to avoid garbage collection. An approach that would avoid the allocation and the try/with would be to check if the queue is empty before calling deque_exn
.
@@ -51,7 +49,7 @@ module Deque(T:sig type t val sentinel : t end) : sig | |||
val is_empty : t -> bool | |||
|
|||
val enqueue : elem -> t -> unit | |||
val dequeue_exn : t -> elem | |||
val dequeue: t -> elem option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using options will cause allocation, which is what the use of exceptions was avoiding.
Hm, isn't that option nearly guaranteed to not escape the minor heap? In this case allocation will only cost as much as addition to machine word in memory (which will likely be in CPU cache if not in some register). GC of minor heap is just move of this machine word back, promoting all live objects to major heap, which is the most expensive part. But may be I'm just plain wrong on this as I'm not an export in OCaml GC. Explicit checks for queue lengths would look a bit ugly... Do you have some way of profiling this? I'm not experience with proper tools to do profiling for OCaml code either. |
I've ran some benchmarks with httpaf (while troubleshooting httpaf/139) and looks like this PR does not introduce a slowdown, but actually improves performance a bit. Also it fixes the stack overflow, described in linked httpaf issue. Here are some numbers: Each run is curl launch performing POST of large file to lwt_echo_post example from httpaf repository:
Development and release correspond to dune profiles used to build the examples. |
I may be misunderstanding your column labels, and the units that are being reported. But assuming that "patched" is this PR, and the numbers are latencies, then it looks like the patched version is slightly slower. In any case, the point of avoiding allocation is not to avoid the cost of allocation, but to avoid the cost of garbage collection. The only way to collect less it to generate less garbage, which was my explicit goal when writing the |
Sorry for poorly presented benchmark results, columns are megabytes per second, as reported in curl output (columns "Average Dload" or "Speed Upload"). "patched" is this PR, "vanilla" is e82d6a0. I've added printing of GC stats at server exit and compared the numbers, option-based version indeed increases the amount of work for GC. I've implemented the approach that you suggested as follows: diff --git a/lib/faraday.ml b/lib/faraday.ml
index f25db14..4429e7a 100644
--- a/lib/faraday.ml
+++ b/lib/faraday.ml
@@ -39,8 +39,6 @@ type 'a iovec =
; off : int
; len : int }
-exception Dequeue_empty
-
module Deque(T:sig type t val sentinel : t end) : sig
type elem = T.t
@@ -95,13 +93,10 @@ end = struct
t.back <- t.back + 1
let dequeue_exn t =
- if is_empty t then
- raise Dequeue_empty
- else
- let result = Array.unsafe_get t.elements t.front in
- Array.unsafe_set t.elements t.front sentinel;
- t.front <- t.front + 1;
- result
+ let result = Array.unsafe_get t.elements t.front in
+ Array.unsafe_set t.elements t.front sentinel;
+ t.front <- t.front + 1;
+ result
let enqueue_front e t =
(* This is in general not true for Deque data structures, but the usage
@@ -367,21 +362,22 @@ let yield t =
t.yield <- true
let rec shift_buffers t written =
- try
+ if not (Buffers.is_empty t.scheduled) then begin
let { len; _ } as iovec = Buffers.dequeue_exn t.scheduled in
if len <= written then begin
shift_buffers t (written - len)
end else
Buffers.enqueue_front (IOVec.shift iovec written) t.scheduled
- with Dequeue_empty ->
+ end else begin
assert (written = 0);
if t.scheduled_pos = t.write_pos then begin
t.scheduled_pos <- 0;
t.write_pos <- 0
end
+ end
let rec shift_flushes t =
- try
+ if not (Flushes.is_empty t.flushed) then begin
let (threshold, f) as flush = Flushes.dequeue_exn t.flushed in
(* Edited notes from @dinosaure:
*
@@ -401,8 +397,7 @@ let rec shift_flushes t =
if t.bytes_written - min_int >= threshold - min_int
then begin f (); shift_flushes t end
else Flushes.enqueue_front flush t.flushed
- with Dequeue_empty ->
- ()
+ end else ()
let shift t written =
shift_buffers t written; Then I've ran 3 tests with wrk to measure latencies. Below are outputs of wrk, followed by output of Vanilla version
Patched (option)
Patched (is_empty call)
I'm not sure why, but the direct is_empty check is the slowest, although it generates least amount of work for GC. |
When compiling to JS (via js_of_ocaml) local exceptions used for control flow are getting very expensive (V8 at least). This PR tries to alleviate some of the places, where functions with local exceptions appeared in the top of Node.js profile.
See #835 in js_of_ocaml repo.