perf-measures: re-introduce httpz #132

renerocksai · 2024-09-20T13:16:10Z

In the 0.12.0 branch, httpz was added to the perf measurements.

Apparently, somehow this got lost, which is a pity. httpz is super-promising.

Given the perf benchmarks in this
PR comment, I would have expected httpz to be on par with or better than zap in our measure.sh tests.

However, on my M3 max mac box, I get the following:

ZAP:

➜  zap git:(reintroduce_httpz_perf) ✗ ./wrk/measure.sh zig-zap
INFO: Listening on port 3000
Listening on 0.0.0.0:3000
INFO: Server is running 4 workers X 4 threads with facil.io 0.7.4 (kqueue)
* Detected capacity: 131056 open file limit
* Root pid: 73099
* Press ^C to stop

INFO: 73110 is running.
INFO: 73111 is running.
INFO: 73112 is running.
INFO: 73113 is running.
========================================================================
                          zig-zap
========================================================================
Running 10s test @ http://127.0.0.1:3000
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.31ms  533.41us  18.77ms   90.57%
    Req/Sec    76.67k     9.04k   86.46k    84.25%
  Latency Distribution
     50%    1.15ms
     75%    1.17ms
     90%    1.75ms
     99%    2.94ms
  3052064 requests in 10.02s, 462.80MB read
  Socket errors: connect 0, read 135, write 0, timeout 0
Requests/sec: 304601.19
Transfer/sec:     46.19MB

httpz:

➜  zap git:(reintroduce_httpz_perf) ✗ ./wrk/measure.sh httpz
========================================================================
                          httpz
========================================================================
Running 10s test @ http://127.0.0.1:3000
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.26ms  528.72us  18.84ms   84.61%
    Req/Sec    44.46k     7.35k   85.50k    88.00%
  Latency Distribution
     50%    2.35ms
     75%    2.39ms
     90%    2.43ms
     99%    3.26ms
  1768925 requests in 10.01s, 91.10MB read
  Socket errors: connect 0, read 230, write 0, timeout 0
Requests/sec: 176712.50
Transfer/sec:      9.10MB

Which looks way off. I must admit, I might have done a bad httpz implementation.

Seeking help from @karlseguin. My motivation: route people away from zap to alternatives like httpz or even zzz, as those are pure zig, and seem to be of really good performance. I want a world in which we don't have to resort to C frameworks to do good, zig-worthy servers 😄 to come true.

@karlseguin

In the 0.12.0 branch, [httpz](https://github.com/karlseguin/http.zig) was added to the perf measurements. Apparently, somehow this got lost, which is a pity. httpz is super-promising. Given the perf benchmarks in [this PR comment](antonputra/tutorials#280 (comment)), I would have expected httpz to be on par with or better than zap in our `measure.sh` tests. However, on my M3 max mac box, I get the following: **ZAP**: ``` ➜ zap git:(reintroduce_httpz_perf) ✗ ./wrk/measure.sh zig-zap INFO: Listening on port 3000 Listening on 0.0.0.0:3000 INFO: Server is running 4 workers X 4 threads with facil.io 0.7.4 (kqueue) * Detected capacity: 131056 open file limit * Root pid: 73099 * Press ^C to stop INFO: 73110 is running. INFO: 73111 is running. INFO: 73112 is running. INFO: 73113 is running. ======================================================================== zig-zap ======================================================================== Running 10s test @ http://127.0.0.1:3000 4 threads and 400 connections Thread Stats Avg Stdev Max +/- Stdev Latency 1.31ms 533.41us 18.77ms 90.57% Req/Sec 76.67k 9.04k 86.46k 84.25% Latency Distribution 50% 1.15ms 75% 1.17ms 90% 1.75ms 99% 2.94ms 3052064 requests in 10.02s, 462.80MB read Socket errors: connect 0, read 135, write 0, timeout 0 Requests/sec: 304601.19 Transfer/sec: 46.19MB ``` **httpz**: ``` ➜ zap git:(reintroduce_httpz_perf) ✗ ./wrk/measure.sh httpz ======================================================================== httpz ======================================================================== Running 10s test @ http://127.0.0.1:3000 4 threads and 400 connections Thread Stats Avg Stdev Max +/- Stdev Latency 2.26ms 528.72us 18.84ms 84.61% Req/Sec 44.46k 7.35k 85.50k 88.00% Latency Distribution 50% 2.35ms 75% 2.39ms 90% 2.43ms 99% 3.26ms 1768925 requests in 10.01s, 91.10MB read Socket errors: connect 0, read 230, write 0, timeout 0 Requests/sec: 176712.50 Transfer/sec: 9.10MB ``` Which looks way off. I must admit, I might have done a bad httpz implementation. Seeking help from @karlseguin. My motivation: route people away from zap to alternatives like httpz or even zzz, as those are pure zig, and seem to be of really good performance. I want a world in which we don't have to resort to C frameworks to do good, zig-worthy servers 😄 to come true.

renerocksai · 2024-09-20T13:17:03Z

BTW: I am aware that taking perf measures on a mac is not what I usually do. I just don't have access to that Linux box ATM.

zigster64 · 2024-09-20T21:52:11Z

Super interesting!

im getting similar httpz numbers on both M2 pro and Ryzen 5/linux

but nowhere near 300 for zap - its more ballpark with the others. Impressive

you are making me want to get an M3 max !

karlseguin · 2024-09-21T00:55:05Z

It is weird how different my results are.

On an m2:

./wrk/measure.sh httpz
========================================================================
                          httpz
========================================================================
Running 10s test @ http://127.0.0.1:3000
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.10ms    5.27ms 106.64ms   96.99%
    Req/Sec    61.13k    28.18k  241.88k    83.84%
  Latency Distribution
     50%    1.56ms
     75%    1.85ms
     90%    2.01ms
     99%   25.35ms
  2426670 requests in 10.10s, 124.97MB read
  Socket errors: connect 0, read 386, write 0, timeout 0
Requests/sec: 240254.03
Transfer/sec:     12.37MB

./wrk/measure.sh zig-zap 
========================================================================
                          zig-zap
========================================================================
Running 10s test @ http://127.0.0.1:3000
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.57ms    8.69ms 160.88ms   95.37%
    Req/Sec    58.60k    35.92k  250.13k    74.94%
  Latency Distribution
     50%  665.00us
     75%    1.13ms
     90%    3.79ms
     99%   45.87ms
  2327928 requests in 10.09s, 352.99MB read
  Socket errors: connect 0, read 387, write 0, timeout 0
Requests/sec: 230727.51
Transfer/sec:     34.99MB

On an E3-1275 v6

========================================================================
                          httpz
========================================================================
Running 10s test @ http://127.0.0.1:3000
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.28ms  377.20us  21.02ms   98.00%
    Req/Sec    78.78k     3.40k  103.86k    74.00%
  Latency Distribution
     50%    1.26ms
     75%    1.29ms
     90%    1.37ms
     99%    1.53ms
  3136631 requests in 10.03s, 161.53MB read
Requests/sec: 312587.49

========================================================================
                          zig-zap
========================================================================
Running 10s test @ http://127.0.0.1:3000
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.34ms    1.92ms  54.45ms   95.66%
    Req/Sec    93.75k     3.88k  111.99k    91.00%
  Latency Distribution
     50%    1.00ms
     75%    1.09ms
     90%    1.41ms
     99%   10.63ms
  3732001 requests in 10.05s, 565.90MB read
Requests/sec: 371458.89
Transfer/sec:     56.33MB

For a "hello world" example, the main thing you can do is tweak the worker count and thread pool size. But if you're running wrk on the same machine as the server, I don't think you have any cores to spare. I tried various settings for both, and they largely just hurt performance:

    var server = try httpz.Server(void).init(allocator, .{
        .port = 3000,
        .workers = .{.count = 2},
        .thread_pool = .{.count = 6},
    }, {});

karlseguin · 2024-09-21T10:28:54Z

I'm starting to suspect/fear that httpz has some scaling issues. I tested it on a 32 vcpu cloud instance, and couldn't get it to scale linearly (or close to) with # of threads. Although, for the life of me, I can't figure out where the bottleneck is. Profiling showswritevas being the largest bottleneck. Gonna wait to see Anton's newest video, to see if he runs into the same thing...since his testing setup is better than mine.

renerocksai · 2024-09-22T08:48:34Z

Super interesting!

im getting similar httpz numbers on both M2 pro and Ryzen 5/linux

but nowhere near 300 for zap - its more ballpark with the others. Impressive

you are making me want to get an M3 max !

Thanks for sharing! Interesting to see the M2 pro numbers!

renerocksai · 2024-09-22T08:55:54Z

It is weird how different my results are.

On an m2:

[...]

Awesome! Thanks for trying it with your configurations! The httpz Linux Transfer/sec readings would have been interesting, got cut off, but nvm.

Looking at the differences on a Linux machine, httpz and zap don't seem that far off. Those are the only numbers that really matter IMHO because if you are serious about a server, you don't run it on a Mac - might be a hot take IDK.

For a "hello world" example, the main thing you can do is tweak the worker count and thread pool size. But if you're running wrk on the same machine as the server, I don't think you have any cores to spare. I tried various settings for both, and they largely just hurt performance:
    var server = try httpz.Server(void).init(allocator, .{
        .port = 3000,
        .workers = .{.count = 2},
        .thread_pool = .{.count = 6},
    }, {});

Yeah, you have to be careful not to allocate more cores than you have - and cores are not created equal, esp. on new Macs.

renerocksai · 2024-09-22T09:08:18Z

I'm starting to suspect/fear that httpz has some scaling issues. I tested it on a 32 vcpu cloud instance, and couldn't get it to scale linearly (or close to) with # of threads. Although, for the life of me, I can't figure out where the bottleneck is. Profiling showswritevas being the largest bottleneck. Gonna wait to see Anton's newest video, to see if he runs into the same thing...since his testing setup is better than mine.

Very interesting! I have no clue wrt httpz either. Do you mean writev as bottleneck being a (syscall) contention bottleneck? Or is it time spent in writev?

Hypotheticals that come to mind: ... wait.

Actually, as food for thought, here are some ideas from ChatGPT :-)

Thread Contention on Shared Resources:

• File Descriptor Contention: If the worker threads are contending for access to shared file descriptors or other shared resources (e.g., logging, connection state), this can create bottlenecks. Even if epoll is only used for accepting connections, contention on these resources can slow down the overall processing.
CPU Cache Contention:

• Cache Line Contention: As the number of threads increases, the worker threads might start contending for CPU cache lines, especially if they are working on shared data structures or frequently accessing similar memory addresses. This can reduce performance and prevent linear scaling.
• False Sharing: If threads are working on variables that are close together in memory but are supposed to be independent, they could cause false sharing, where updates to these variables cause unnecessary cache invalidations.
I/O Subsystem Bottlenecks:

• Network or Disk I/O Saturation: The worker threads are likely performing readv and writev on network sockets or disk files. If the I/O subsystem (network or disk) is saturated, adding more threads won’t increase throughput because the underlying hardware has reached its limit.
• TCP/IP Stack Limits: On a heavily loaded server, the TCP/IP stack itself might become a bottleneck, especially if it’s handling a large number of connections. This can happen even if there are plenty of CPU cores available.
NUMA Effects:

• NUMA Node Misalignment: If your server is running on a NUMA (Non-Uniform Memory Access) architecture, threads might be running on different NUMA nodes and accessing memory that is not local to their node. This can significantly increase memory access latency and reduce scalability. Ensuring that threads are properly pinned to cores and that memory is allocated on the same NUMA node as the thread can help.
Load Imbalance:

• Uneven Workload Distribution: If some worker threads are processing more data or more complex requests than others, it can lead to load imbalance. This imbalance can cause some threads to become bottlenecks while others are underutilized, reducing overall scalability.
Thread Pool Overhead:

• Thread Management Overhead: The overhead of managing a large number of threads (e.g., context switching, task queue management) might increase as more threads are added. This overhead can limit scalability, especially if the thread pool implementation is not optimized for high concurrency.
Suboptimal Use of readv and writev:

• Inefficient Buffer Management: If the buffers used in readv and writev are not optimally sized or aligned, the performance benefits of these system calls may not be fully realized. This could lead to suboptimal I/O performance as the number of threads increases.
• Partial Reads/Writes: If the worker threads are not handling partial reads/writes efficiently, this can lead to increased syscall overhead or I/O blocking, which can degrade performance as more threads are added.

Potential Solutions:

•	Optimize Resource Access: Minimize contention on shared resources by using thread-local storage, lock-free data structures, or reducing the critical section of code that needs to be synchronized.
•	NUMA Awareness: Ensure that threads are properly pinned to cores and that memory is allocated on the same NUMA node to reduce latency.
•	Balance Load: Implement or improve load balancing mechanisms to ensure that work is evenly distributed among worker threads.
•	Profile I/O Operations: Profile your I/O subsystem to identify any bottlenecks in the network or disk I/O, and optimize readv and writev usage by fine-tuning buffer sizes and ensuring efficient handling of partial reads/writes.
•	Reduce Thread Pool Overhead: Consider tuning the thread pool size and task management to reduce overhead. Sometimes, fewer, more efficiently managed threads can outperform a larger pool with more overhead.

Sorry, it's a bit verbose, but I find it did a great job anyway :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf-measures: re-introduce httpz #132

perf-measures: re-introduce httpz #132

renerocksai commented Sep 20, 2024

renerocksai commented Sep 20, 2024

zigster64 commented Sep 20, 2024

karlseguin commented Sep 21, 2024 •

edited

Loading

karlseguin commented Sep 21, 2024

renerocksai commented Sep 22, 2024

renerocksai commented Sep 22, 2024

renerocksai commented Sep 22, 2024

perf-measures: re-introduce httpz #132

Are you sure you want to change the base?

perf-measures: re-introduce httpz #132

Conversation

renerocksai commented Sep 20, 2024

renerocksai commented Sep 20, 2024

zigster64 commented Sep 20, 2024

karlseguin commented Sep 21, 2024 • edited Loading

karlseguin commented Sep 21, 2024

renerocksai commented Sep 22, 2024

renerocksai commented Sep 22, 2024

renerocksai commented Sep 22, 2024

karlseguin commented Sep 21, 2024 •

edited

Loading