-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential startup regression since on Node.js >= 22 #180
Comments
I cannot reproduce on a Linux server though can reproduce on an arm64 macbook. There has been a regression in the range of |
Some regressions can also be observed from d8.
|
Whatever it is it seems to be happening at the global level, because for (int i = 1; i < argc; i++) { // Start from 1 to skip program name
if (strcmp(argv[i], "--version") == 0) {
printf("%s\n", NODE_VERSION);
return 0;
}
} to the main function to skip everything of substance later.
|
I checked on that on a Linux server with x64, there is a measurable but tiny regression. node 20
node 23
|
In my previous comment, regarding linux/x64, the regression is small and possibly irrelevant. However, the regression under ARM-based macOS seems significant. Let us create a file with the following content Node 20 (linux x64)
Node 23 (linux x64)
Node 20 (macos ARM)
Node 23 (macos ARM)
For reference, bun is under 5 ms in this test on my mac. |
With this diff applied to skip most things in the main function:
The regression does seem to primarily come from the V8 11.3 -> 11.8 upgrade, and specifically from global initializations as the main function isn't doing much with the patch applied:
|
I started a Neoverse N1 node on AWS to check if the problem might be with ARM. I think it is not. It looks like a macOS issue OR an LLVM issue OR an LLVM on macOS issue. Node 20
Node 23
|
By the way testing |
@joyeecheung Sure. But the effect is large (~ 10 ms). 10 ms is quite a large effect. Especially if you account for the fact that bun can print "hello" in 5 ms. |
That should be a separate issue about the performance of |
Since we are checking global initializations here...I changed the patch to make it actually an no-op diff --git a/src/node_main.cc b/src/node_main.cc
index f66099a557..86cc30600c 100644
--- a/src/node_main.cc
+++ b/src/node_main.cc
@@ -94,6 +94,7 @@ int wmain(int argc, wchar_t* wargv[]) {
// UNIX
int main(int argc, char* argv[]) {
- return node::Start(argc, argv);
+ return 0;
+ //return node::Start(argc, argv);
}
#endif With the patch applied to make the main function a no-op, the regression continues (86cb23d0 is the last commit updating the gypfiles for 11.8 upgrade, 5633c622 is where main is today).
|
Let us use Node 16
Node 17
Node 18
Node 19
Node 20
Node 21
Node 22
Node 23
AnalysisSo it seems that the last few versions of Node.js have added well over 12 ms to the startup time on some macOS systems. The regression does not appear to have happened all at once. |
@just-js mentioned that it might be a binary size threshold, just to comment that I previously tried to build with |
i've tried running on my 8th generation core i5 laptop on Ubuntu 22.04 (kernel 6.8.0.48-generic) and i can see a regression on linux using both hyperfine and andrew kelley's poop in a couple of node 22 releases before the latest one, which seems to be working normally again. we can see using poop for those releases the number of cpu instructions executed was far higher than before or after. i will see if i can run the same benchmarks on macos over the weekend and will report back. i'll also try to capture more detail on the instructions executed if i can repro the issue there. the script i am running is basically this ./node --version
ls -la ./node
hyperfine --warmup 20 "./node semicolon.js" 2>/dev/null
sudo poop "./node semicolon.js" v20.11.0
-rwxr-xr-x 1 andrew andrew 96331384 Nov 7 16:04 ./node-20
Benchmark 1: ./node-20 semicolon.js
Time (mean ± σ): 24.3 ms ± 0.2 ms [User: 16.2 ms, System: 8.3 ms]
Range (min … max): 24.0 ms … 25.5 ms 120 runs
Benchmark 1 (198 runs): ./node-20 semicolon.js
measurement mean ± σ min … max outliers
wall_time 25.2ms ± 959us 24.1ms … 35.4ms 1 ( 1%)
peak_rss 39.1MB ± 143KB 38.3MB … 39.2MB 38 (19%)
cpu_cycles 55.3M ± 2.75M 52.7M … 84.7M 4 ( 2%)
instructions 80.4M ± 130K 80.1M … 80.8M 4 ( 2%)
cache_references 2.13M ± 43.7K 2.06M … 2.38M 14 ( 7%)
cache_misses 380K ± 54.9K 304K … 604K 1 ( 1%)
branch_misses 388K ± 24.8K 366K … 582K 11 ( 6%)
v21.7.3
-rwxr-xr-x 1 andrew andrew 102202728 Nov 7 16:05 ./node-21
Benchmark 1: ./node-21 semicolon.js
Time (mean ± σ): 22.7 ms ± 0.1 ms [User: 15.4 ms, System: 7.0 ms]
Range (min … max): 22.3 ms … 23.0 ms 128 runs
Benchmark 1 (213 runs): ./node-21 semicolon.js
measurement mean ± σ min … max outliers
wall_time 23.4ms ± 700us 22.5ms … 27.3ms 13 ( 6%)
peak_rss 40.2MB ± 41.8KB 40.0MB … 40.4MB 13 ( 6%)
cpu_cycles 51.4M ± 1.89M 49.6M … 63.6M 11 ( 5%)
instructions 77.4M ± 130K 77.0M … 77.8M 2 ( 1%)
cache_references 1.83M ± 44.2K 1.70M … 2.03M 23 (11%)
cache_misses 322K ± 56.5K 257K … 448K 0 ( 0%)
branch_misses 357K ± 21.1K 342K … 539K 23 (11%)
v22.6.0
-rwxr-xr-x 1 andrew andrew 119322416 Nov 7 16:05 ./node-22
Benchmark 1: ./node-22 semicolon.js
Time (mean ± σ): 32.7 ms ± 0.5 ms [User: 25.2 ms, System: 7.1 ms]
Range (min … max): 32.3 ms … 36.7 ms 89 runs
Benchmark 1 (147 runs): ./node-22 semicolon.js
measurement mean ± σ min … max outliers
wall_time 34.0ms ± 1.58ms 32.6ms … 49.1ms 2 ( 1%)
peak_rss 46.0MB ± 52.9KB 45.7MB … 46.0MB 14 (10%)
cpu_cycles 85.0M ± 4.77M 81.8M … 133M 6 ( 4%)
instructions 142M ± 146K 142M … 143M 0 ( 0%)
cache_references 2.02M ± 53.2K 1.89M … 2.32M 7 ( 5%)
cache_misses 396K ± 71.8K 300K … 587K 0 ( 0%)
branch_misses 559K ± 58.8K 527K … 1.15M 13 ( 9%)
v22.7.0
-rwxr-xr-x 1 andrew andrew 116859992 Nov 7 16:05 ./node-22-7
Benchmark 1: ./node-22-7 semicolon.js
Time (mean ± σ): 32.5 ms ± 0.2 ms [User: 24.9 ms, System: 7.3 ms]
Range (min … max): 32.1 ms … 32.9 ms 89 runs
Benchmark 1 (148 runs): ./node-22-7 semicolon.js
measurement mean ± σ min … max outliers
wall_time 33.7ms ± 980us 32.5ms … 37.1ms 6 ( 4%)
peak_rss 45.7MB ± 47.2KB 45.5MB … 45.7MB 15 (10%)
cpu_cycles 84.2M ± 2.66M 81.4M … 94.8M 5 ( 3%)
instructions 143M ± 187K 142M … 143M 4 ( 3%)
cache_references 2.06M ± 38.9K 1.94M … 2.21M 4 ( 3%)
cache_misses 404K ± 77.8K 303K … 539K 0 ( 0%)
branch_misses 551K ± 30.3K 524K … 701K 10 ( 7%)
v22.11.0
-rwxr-xr-x 1 andrew andrew 117383456 Nov 7 16:05 ./node-22-11
Benchmark 1: ./node-22-11 semicolon.js
Time (mean ± σ): 21.2 ms ± 0.1 ms [User: 13.9 ms, System: 6.9 ms]
Range (min … max): 20.7 ms … 21.7 ms 137 runs
Benchmark 1 (225 runs): ./node-22-11 semicolon.js
measurement mean ± σ min … max outliers
wall_time 22.2ms ± 1.22ms 21.1ms … 33.5ms 22 (10%)
peak_rss 42.6MB ± 63.1KB 42.1MB … 42.7MB 32 (14%)
cpu_cycles 45.8M ± 2.91M 43.5M … 69.6M 17 ( 8%)
instructions 75.2M ± 142K 74.8M … 75.7M 6 ( 3%)
cache_references 1.16M ± 38.2K 1.06M … 1.35M 23 (10%)
cache_misses 355K ± 65.7K 277K … 499K 0 ( 0%)
branch_misses 273K ± 25.8K 256K … 446K 30 (13%)
v23.0.0
-rwxr-xr-x 1 andrew andrew 120543848 Nov 7 16:05 ./node-23
Benchmark 1: ./node-23 semicolon.js
Time (mean ± σ): 20.7 ms ± 0.1 ms [User: 12.9 ms, System: 7.4 ms]
Range (min … max): 20.5 ms … 21.2 ms 138 runs
Benchmark 1 (232 runs): ./node-23 semicolon.js
measurement mean ± σ min … max outliers
wall_time 21.5ms ± 777us 20.6ms … 24.6ms 23 (10%)
peak_rss 43.5MB ± 135KB 42.7MB … 43.5MB 36 (16%)
cpu_cycles 43.6M ± 2.10M 41.6M … 51.6M 21 ( 9%)
instructions 69.2M ± 100.0K 69.0M … 69.6M 4 ( 2%)
cache_references 1.15M ± 34.7K 1.05M … 1.34M 22 ( 9%)
cache_misses 355K ± 57.5K 281K … 491K 0 ( 0%)
branch_misses 269K ± 24.1K 251K … 433K 24 (10%)
v23.1.0
-rwxr-xr-x 1 andrew andrew 121761992 Oct 24 13:22 ./node-23-1
Benchmark 1: ./node-23-1 semicolon.js
Time (mean ± σ): 21.7 ms ± 1.0 ms [User: 13.3 ms, System: 8.1 ms]
Range (min … max): 21.1 ms … 29.0 ms 134 runs
Benchmark 1 (223 runs): ./node-23-1 semicolon.js
measurement mean ± σ min … max outliers
wall_time 22.3ms ± 1.15ms 21.2ms … 33.5ms 22 (10%)
peak_rss 43.9MB ± 123KB 43.5MB … 43.9MB 42 (19%)
cpu_cycles 44.2M ± 2.84M 41.9M … 69.7M 21 ( 9%)
instructions 69.3M ± 96.9K 69.1M … 69.8M 4 ( 2%)
cache_references 1.15M ± 37.0K 1.07M … 1.39M 21 ( 9%)
cache_misses 357K ± 58.6K 285K … 528K 0 ( 0%)
branch_misses 272K ± 26.3K 252K … 471K 20 ( 9%) |
this is what i am seeing on andrew@Andrews-MacBook-Pro node-test % ls -la ./node-*
-rwxr-xr-x 1 andrew wheel 78614896 8 Aug 2023 ./node-16-20
-rwxr-xr-x 1 andrew wheel 79202736 1 Jun 2022 ./node-17-9
-rwxr-xr-x 1 andrew wheel 87121776 8 Jul 13:26 ./node-18-20
-rwxr-xr-x 1 andrew wheel 88683840 10 Apr 2023 ./node-19-9
-rwxr-xr-x 1 andrew wheel 94556304 3 Oct 07:09 ./node-20-18
-rwxr-xr-x 1 andrew wheel 100725664 10 Apr 2024 ./node-21-7
-rwxr-xr-x 1 andrew wheel 111818288 8 Nov 23:54 ./node-22-1
-rwxr-xr-x 1 andrew wheel 119121584 8 Nov 23:50 ./node-22-11
-rwxr-xr-x 1 andrew wheel 113470352 8 Nov 23:50 ./node-22-4
-rwxr-xr-x 1 andrew wheel 125356400 8 Nov 23:52 ./node-23-1
andrew@Andrews-MacBook-Pro node-test % hyperfine --warmup 100 --runs 200 "./node-16-20 semicolon.js" "./node-17-9 semicolon.js" "./node-18-20 semicolon.js" "./node-19-9 semicolon.js" "./node-20-18 semicolon.js" "./node-21-7 semicolon.js" "./node-22-1 semicolon.js" "./node-22-4 semicolon.js" "./node-22-11 semicolon.js" "./node-23-1 semicolon.js" 2>/dev/null
Benchmark 1: ./node-16-20 semicolon.js
Time (mean ± σ): 24.1 ms ± 0.4 ms [User: 20.0 ms, System: 2.7 ms]
Range (min … max): 23.8 ms … 28.2 ms 200 runs
Benchmark 2: ./node-17-9 semicolon.js
Time (mean ± σ): 28.9 ms ± 0.4 ms [User: 20.4 ms, System: 2.9 ms]
Range (min … max): 28.0 ms … 32.1 ms 200 runs
Benchmark 3: ./node-18-20 semicolon.js
Time (mean ± σ): 20.4 ms ± 0.4 ms [User: 16.0 ms, System: 2.9 ms]
Range (min … max): 20.0 ms … 23.8 ms 200 runs
Benchmark 4: ./node-19-9 semicolon.js
Time (mean ± σ): 26.1 ms ± 0.6 ms [User: 17.0 ms, System: 3.2 ms]
Range (min … max): 24.6 ms … 29.9 ms 200 runs
Benchmark 5: ./node-20-18 semicolon.js
Time (mean ± σ): 20.7 ms ± 0.3 ms [User: 16.4 ms, System: 2.7 ms]
Range (min … max): 20.3 ms … 23.5 ms 200 runs
Benchmark 6: ./node-21-7 semicolon.js
Time (mean ± σ): 32.3 ms ± 0.6 ms [User: 23.1 ms, System: 3.0 ms]
Range (min … max): 30.6 ms … 36.5 ms 200 runs
Benchmark 7: ./node-22-1 semicolon.js
Time (mean ± σ): 34.3 ms ± 0.6 ms [User: 24.6 ms, System: 3.0 ms]
Range (min … max): 32.5 ms … 39.8 ms 200 runs
Benchmark 8: ./node-22-4 semicolon.js
Time (mean ± σ): 34.5 ms ± 0.6 ms [User: 24.8 ms, System: 3.1 ms]
Range (min … max): 32.8 ms … 37.9 ms 200 runs
Benchmark 9: ./node-22-11 semicolon.js
Time (mean ± σ): 36.5 ms ± 0.7 ms [User: 26.5 ms, System: 3.2 ms]
Range (min … max): 34.7 ms … 41.0 ms 200 runs
Benchmark 10: ./node-23-1 semicolon.js
Time (mean ± σ): 38.1 ms ± 0.7 ms [User: 27.8 ms, System: 3.3 ms]
Range (min … max): 36.1 ms … 42.8 ms 200 runs
Summary
./node-18-20 semicolon.js ran
1.01 ± 0.02 times faster than ./node-20-18 semicolon.js
1.18 ± 0.03 times faster than ./node-16-20 semicolon.js
1.28 ± 0.04 times faster than ./node-19-9 semicolon.js
1.41 ± 0.03 times faster than ./node-17-9 semicolon.js
1.58 ± 0.04 times faster than ./node-21-7 semicolon.js
1.68 ± 0.04 times faster than ./node-22-1 semicolon.js
1.69 ± 0.04 times faster than ./node-22-4 semicolon.js
1.78 ± 0.05 times faster than ./node-22-11 semicolon.js
1.86 ± 0.05 times faster than ./node-23-1 semicolon.js |
Intriguing. |
another couple of things:
|
i am no expert in tracing processes on macos but i have managed to capture flamegraphs for node 20.18 (which is fast) and node 22.11 (which is slow). am having a look now to see if there is anything useful here that might shed some light on the issue. these are the commands i ran in order to capture these flamegraphs, assuming the two node processes are named curl -L -o stackcollapse.pl https://raw.githubusercontent.com/brendangregg/FlameGraph/refs/heads/master/stackcollapse.pl
chmod +x stackcollapse.pl
curl -L -o flamegraph.pl https://raw.githubusercontent.com/brendangregg/FlameGraph/refs/heads/master/flamegraph.pl
chmod +x flamegraph.pl
sudo dtrace -c "./node-20-18 semicolon.js" -o node-20-18.stacks -n 'profile-5000 /execname == "node-20-18"/ { @[ustack(100)] = count(); }'
./stackcollapse.pl node-20-18.stacks | ./flamegraph.pl > node-20-18.svg
sudo dtrace -c "./node-22-11 semicolon.js" -o node-22-11.stacks -n 'profile-5000 /execname == "node-22-11"/ { @[ustack(100)] = count(); }'
./stackcollapse.pl node-22-11.stacks | ./flamegraph.pl > node-22-11.svg node-22-11 (slow)node-20-18 (fast) |
A few things that jump out immediately:
i tried a few different things in order to capture these flamegraphs, most of which were super painful and involved using xcode gui and Instruments tool etc. but then i came across this article which outlined the approach i ended up using. i'll keep digging into this over next few days when i can find the time, but if anyone has any insight or suggestions for better ways of capturing the required detail on macos, please let me know. 🙏 |
interesting. i'm not sure i understand what you are suggesting though - can you explain further? if we click on
those differences add up to 2.7 + 0.4 + 2.5 + 2.5 = ~8% which is the total difference between the two versions for so, maybe we can assume, without digging into code that cppgc being added and some differences between openssl versions in setup and teardown are the source of the differences in this particular function? as an aside, hyperfine is probably not the best tool to investigate this problem as it will also account for the time taken tearing the process down, which we are not particularly interested in, unless we are actually concerned about the time to cycle a very small program. and, possibly related, i have read about openssl v3 introducing some pretty serious overhead compared to 1.1. maybe this is something to do with it? but then, why would node major versions 17 and 19 also be affected? or does node use system libssl on macos? again, i would have to have a dig into the node.js source to understand exactly what is going on here. if anyone else knows already, please feel free to comment. also, these are obv. percentages of different overall wall clock times - ~36ms versus ~20ms, so for this function that would be 30% of 36ms or 12ms versus 22% of 20ms or 4.4ms - a ~7.6ms difference in total - roughly half of the total difference we are seeing. |
it seems all the node versions in question here - 17, 18, 19, 20, 22 are on openssl v3. node seems to use a fork of openssl even on macos? |
if we break down the overhead between versions 22.11 and 20.18 and multiply by the percentage of total time shown in the flamegraphs above we can see the following: total overhead:
so, we are not so interested in the teardown overhead and should probably focus on overhead in afaict, most of the overhead in |
Is there a quick way to build node with OpenSSL disabled? (Evidently, such a node version might be dysfunctional.)
I think that your analysis leads you to the conclusion that there is significant overhead in InitializeOncePerProcessInternal. Your numbers indicate it is of significant interest.
Though there is contradictory evidence in this thread, on a Linux box (as opposed to macOS), using various node versions, I do not see much of a regression as we advance through the versions, see #180 (comment) You get somewhat complicated results under Linux: #180 (comment) If the problem is OpenSSL initialization, can we link the performance regressions to changes in OpenSSL? |
|
yes, i do not see any regressions on linux/x64 on my pretty old i5 machine when i use the final releases for each major node version. as i said in comment above i only saw the regressions in intermediate releases, so agree it is confusing exactly what is going on there. but if we stick to the final release for each version then i think we can say clearly on linux/x64 on my i5 laptop performance is steadily improving, while on macos/arm64, as shown above, it seems to be getting worse. 🤷 hyperfine --warmup 100 --runs 200 "node-v17.9.1-linux-x64/bin/node semicolon.js" "node-v18.20.4-linux-x64/bin/node semicolon.js" "node-v19.9.0-linux-x64/bin/node semicolon.js" "node-v20.18.0-linux-x64/bin/node semicolon.js" "node-v21.7.3-linux-x64/bin/node semicolon.js" "node-v22.11.0-linux-x64/bin/node semicolon.js" "node-v23.2.0-linux-x64/bin/node semicolon.js" 2>/dev/null
Benchmark 1: node-v17.9.1-linux-x64/bin/node semicolon.js
Time (mean ± σ): 36.2 ms ± 0.8 ms [User: 27.5 ms, System: 8.4 ms]
Range (min … max): 35.6 ms … 45.1 ms 200 runs
Benchmark 2: node-v18.20.4-linux-x64/bin/node semicolon.js
Time (mean ± σ): 27.6 ms ± 1.3 ms [User: 18.3 ms, System: 9.0 ms]
Range (min … max): 27.0 ms … 43.0 ms 200 runs
Benchmark 3: node-v19.9.0-linux-x64/bin/node semicolon.js
Time (mean ± σ): 28.1 ms ± 0.2 ms [User: 18.9 ms, System: 8.9 ms]
Range (min … max): 27.6 ms … 28.9 ms 200 runs
Benchmark 4: node-v20.18.0-linux-x64/bin/node semicolon.js
Time (mean ± σ): 23.7 ms ± 0.6 ms [User: 16.1 ms, System: 7.3 ms]
Range (min … max): 23.2 ms … 29.7 ms 200 runs
Benchmark 5: node-v21.7.3-linux-x64/bin/node semicolon.js
Time (mean ± σ): 23.5 ms ± 0.5 ms [User: 15.5 ms, System: 7.6 ms]
Range (min … max): 23.1 ms … 29.9 ms 200 runs
Benchmark 6: node-v22.11.0-linux-x64/bin/node semicolon.js
Time (mean ± σ): 22.2 ms ± 0.4 ms [User: 14.0 ms, System: 7.9 ms]
Range (min … max): 21.8 ms … 26.4 ms 200 runs
Benchmark 7: node-v23.2.0-linux-x64/bin/node semicolon.js
Time (mean ± σ): 21.8 ms ± 0.2 ms [User: 13.7 ms, System: 8.1 ms]
Range (min … max): 21.4 ms … 23.5 ms 200 runs
Summary
'node-v23.2.0-linux-x64/bin/node semicolon.js' ran
1.02 ± 0.02 times faster than 'node-v22.11.0-linux-x64/bin/node semicolon.js'
1.08 ± 0.03 times faster than 'node-v21.7.3-linux-x64/bin/node semicolon.js'
1.09 ± 0.03 times faster than 'node-v20.18.0-linux-x64/bin/node semicolon.js'
1.27 ± 0.06 times faster than 'node-v18.20.4-linux-x64/bin/node semicolon.js'
1.29 ± 0.02 times faster than 'node-v19.9.0-linux-x64/bin/node semicolon.js'
1.66 ± 0.04 times faster than 'node-v17.9.1-linux-x64/bin/node semicolon.js' to summarise the findings from various people above i think we can say the regression only affects arm64 and is likely down to an issue with the ssl library being used and/or v8 snapshot initialisation on arm64 linux and macos? |
That's very helpful. We can use this to try to see if openssl is to blame.
That's consistent with my guess that this is a macOS problem. (Meaning a problem that occurs on macOS with node.) |
ok, so i did the following on my git clone --depth 1 --single-branch -b v22.11.0 [email protected]:nodejs/node.git
./configure
make -j 16 and to build the non-ssl version of node i did ./configure --without-ssl
make -j 16 interestingly, the binary for the default build with ssl is these are my compiler/linker versions: Xcode 16.1
Build version 16B40
Homebrew LLD 16.0.6
Apple clang version 16.0.0 (clang-1600.0.26.4)
Target: arm64-apple-darwin24.1.0
Thread model: posix and these are the results of hyperfine test comparing official node-startup % hyperfine --warmup 100 --runs 200 "./node-22-11 semicolon.js" "./node-22-11-new semicolon.js" "./node-22-11-nossl semicolon.js" "./node-20-18 semicolon.js" 2>/dev/null
Benchmark 1: ./node-22-11 semicolon.js
Time (mean ± σ): 35.3 ms ± 0.6 ms [User: 26.5 ms, System: 3.1 ms]
Range (min … max): 33.4 ms … 40.0 ms 200 runs
Benchmark 2: ./node-22-11-new semicolon.js
Time (mean ± σ): 30.4 ms ± 0.7 ms [User: 25.5 ms, System: 3.2 ms]
Range (min … max): 29.8 ms … 35.1 ms 200 runs
Benchmark 3: ./node-22-11-nossl semicolon.js
Time (mean ± σ): 26.9 ms ± 0.4 ms [User: 22.5 ms, System: 2.9 ms]
Range (min … max): 26.5 ms … 31.1 ms 200 runs
Benchmark 4: ./node-20-18 semicolon.js
Time (mean ± σ): 19.9 ms ± 0.4 ms [User: 16.4 ms, System: 2.7 ms]
Range (min … max): 19.6 ms … 23.6 ms 200 runs
Summary
./node-20-18 semicolon.js ran
1.35 ± 0.03 times faster than ./node-22-11-nossl semicolon.js
1.53 ± 0.04 times faster than ./node-22-11-new semicolon.js
1.77 ± 0.05 times faster than ./node-22-11 semicolon.js so, a local build is ~12% faster without ssl support than local build with ssl support. not a huge difference and obv. not as fast as the v20 official release with ssl support. i can/will run some more tests now i have everything building locally. if anyone has any suggestions for further tests i can do please let me know. also, @richardlau, @RafaelGSS, @joyeecheung - are there specific flags that are set when building a release version of node in CI that i should enable locally? right now i am just using the defaults. PS - something interesting to note is fact that user + sys time for local ssl build is only 1ms less than the officially released version but wall clock time is 5ms less. this would i think indicate there is some blocking/async work happening in the official release that isn't happening with my local build. 🤷♂️ |
Thanks for running these tests. It suggests to me that openssl is affected by the issue but is not the cause of the issue. So I am back at my theory: this is a case where dynamic linking and rebasing is expensive under macOS for some reason. |
i'm not clear on your reasoning for this. can you explain a bit more? if i look at what libraries the binary is dynamically linking it's only the following. and why would this be getting worse over time if we are not linking against more external libs? otool -L out/Release/node
out/Release/node:
/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 3107.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1351.0.0)
/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1800.101.0) i'll keep digging and see what else i can find. |
It is a theory, an hypothesis. It could be wrong. I don't know at this point what the issue is. But... You can see visually dyld taking more time on your flamegraphs. So dyld definitively takes more time. It is just a fact. I do expect that the libraries used are the same, but I do not think that it implies that time used by dyld will be the same. The only way to check my theory would be to statically link or prebind the shared libraries, but I don't think you can generally statically link under macOS, and I am not sure prebinding is currently supported. |
i tried another little experiment, compiling node with shared sqlite and zlib and without ssl and some other options and we see a slight improvement. so, i'm not sure in general dynamic linking is the issue here. i'll see if i can get a flamegraph of the non-ssl version next. the binary is only 53 MB when compiled with these flags. ( ./configure --without-ssl --without-intl --enable-lto --shared-sqlite --shared-zlib --without-npm --without-node-options --without-amaro --without-corepack --without-node-code-cache --disable-single-executable-application hyperfine --warmup 100 --runs 200 "./node-22-11 semicolon.js" "./node-22-11-new semicolon.js" "./node-22-11-nossl semicolon.js" "./node-20-18 semicolon.js" "./node-22-11-min semicolon.js" 2>/dev/null
Benchmark 1: ./node-22-11 semicolon.js
Time (mean ± σ): 35.4 ms ± 0.7 ms [User: 26.6 ms, System: 3.0 ms]
Range (min … max): 33.4 ms … 39.7 ms 200 runs
Benchmark 2: ./node-22-11-new semicolon.js
Time (mean ± σ): 29.4 ms ± 0.3 ms [User: 25.7 ms, System: 2.9 ms]
Range (min … max): 28.9 ms … 31.8 ms 200 runs
Benchmark 3: ./node-22-11-nossl semicolon.js
Time (mean ± σ): 26.1 ms ± 0.4 ms [User: 22.7 ms, System: 2.7 ms]
Range (min … max): 25.5 ms … 28.8 ms 200 runs
Benchmark 4: ./node-20-18 semicolon.js
Time (mean ± σ): 20.0 ms ± 0.4 ms [User: 16.6 ms, System: 2.6 ms]
Range (min … max): 19.5 ms … 23.5 ms 200 runs
Benchmark 5: ./node-22-11-min semicolon.js
Time (mean ± σ): 24.9 ms ± 0.4 ms [User: 21.3 ms, System: 2.4 ms]
Range (min … max): 24.1 ms … 28.6 ms 200 runs
Summary
./node-20-18 semicolon.js ran
1.24 ± 0.03 times faster than ./node-22-11-min semicolon.js
1.30 ± 0.03 times faster than ./node-22-11-nossl semicolon.js
1.47 ± 0.03 times faster than ./node-22-11-new semicolon.js
1.77 ± 0.05 times faster than ./node-22-11 semicolon.js |
Hi, I'll try to catch up when I get some time. I don't have much bandwidth right now. |
no worries. i will dump as much info as i can find here and we can hopefully make some sense of it all. 🙏 |
Bun is 58MB and it runs in about 7 ms on my mac with the same benchmark. One maybe significant difference between bun and node is that node depends on Core Foundation whereas bun does not appear to do so... |
here is the flamegraph for the non-ssl locally built node version 22.11.0. also uploaded here as github seems to screw with the svg files i upload somehow. |
Maybe we should change the communication to the slack channel (#nodejs-core-performance) and send only assertive thoughts on the issue? It should make it easier for others (like me) to contribute to something. |
yes bun is crazy fast for a micro bench like this. i think is more to do with JSC than anything. i have tried building a minimal runtime on v8 for macos and best i can do is it also could be down to fact we are also measuring isolate/process teardown time when using hyperfine because in experiments i have done on v8 we can bootstrap a v8 isolate in |
yes - that is my feeling from looking at those flamegraphs. i will see if i can build without snapshot but i imagine that will be slower again. hard to tell without benchmarking. |
sure. sorry for spamming the issue with lots of results. i am just trying to go through as many scenarios as i can rn. @lemire are you good to continue discussion on slack? |
Have you tried making the main function a no-op and benchmark again, like what I did in #180 (comment)? My testing showed that whatever that regressed the startup mostly happened before/after the main function is run. After the V8 11.3 -> 11.8 upgrade the Node.js binary with a no-op main function regressed for about 8ms (or before the V8 upgrade it was 1.6x faster). I am a bit surprised that we are still focusing on what happens inside the main function after 15 days of that comment... Also on Snapshot::Initialize: if you look at the flame graph you will notice that in recent versions of Node.js no internal JS code is compiled at all when executing a CJS script, that's because we moved the internal JS code compilation into build time and serialized the bytecode into the snapshot. You may see a percentage increase in snapshot initialization but that was just because the code are deserialized instead of being compiled now, which is what makes the startup overall faster on other platforms, so it's compensated with all the V8 compile/parse frames disappearing from the flamegraph. You could check if snapshot played a difference at all by using --without-node-snapshot runtime flag and my previous showed that it didn't matter. But then again, my testing also showed that whatever happened in the main function didn't matter because the regression persists even if you make the main function a no-op... |
that's my fault. sorry if i sidetracked the investigation. at least i learned a lot more about how to debug/trace these issues on macos. i can spend some time over next few days following your advice. will post in slack if i find anything useful. 🙏 |
No worries, maybe I wasn't clear enough about my findings. Also, I am not 100% sure about the theory unless someone reproduce on their machine (I was able to reproduce this on two different generation of MacBooks, though). Thanks for helping out and looking into this! |
I tried to profile the empty main function binary on macOS - the binary with an empty main function already takes ~25ms on my laptop, meanwhile v23.x running an empty script takes 32ms, so I was trying to figure out what it's trying to do in the 25ms outside of the main function. Unfortunately neither Instruments, dtrace nor samply gave me adequate information because it seems neither of these work well when you are trying to profile what happens outside of the main function... I do have a theory, though, since it isn't actually doing much outside of the main function (not enough for a sampling profiler to pick up, anyway), I suspect it could be some compilation option changed in the V8 11.3 -> 11.8 upgrade that turned off some optimization or introduced something that would make the loading of the executable more costly. That could explain why this is macOS only because the compiler commands are typically generated in a platform-specific manner. I haven't looked into the actual compiler commands or how they differed along the years, though. |
@joyeecheung i am wondering about best approaches for investigating this. tracing in any serious way on macos seems to require booting into recovery mode afaict? i can see if that sheds any light, but probably not til next weekend. apart from that i have been looking at the build.gn options defined for v8 in the relevant builds of node.js, but it's going to be incredibly tedious work trying to bisect the differences between them and seeing if they make a difference by trial and error. wondering if it would be possible to embed these options in the node runtime going forward so they could easily be dumped with a cli flag at least? 🤔 |
You could be talking about
Yes, that's dtrace/dtruss, I tried to use them but they didn't really reveal anything interesting to me (for something as short-lived as ~20ms, I could sample it at 5000Hz at most and didn't get much samples anyways, and didn't find anything particuarly interesting in dtruss output, but I might not be using it correctly/using a buggy version of it). |
We don't build Node.js with gn, but with gyp. I don't really see any change to compiler options in nodejs/node#49639 |
Following up my theory in #180 (comment) I built V8's Node.js fork and the benchmark result is....amusing: From Node.js main branch:
From V8's fork
I feel that we should probably take a look at their configuration (granted, it's gn, but something might have been missing in the gyp -> gn porting) |
Dumping the v8_features.json:
TIL they enabled pointer compression in their CI (I thought they were consistent with the main branch). Maybe that's the key because I did notice a particularly pricy mprotect call in the dtruss output. But I don't know the right way to disable it yet (or it might not build without pointer compression and need to be rolled back a bunch of commits). Also V8's fork uses their own llvm/clang/libc++ and all that (chromium, you know), so that might be important as well.. |
There were some attempts in nodejs/node#56138 to give LTO a try though I think that's a false positive, see nodejs/node#56138 (comment) and nodejs/node#56138 (comment) - also AFAICT, V8's fork doesn't use LTO, judging by the fact that it takes normal time to build and there aren any flags hinting real LTO in the linker commands (
|
Tried switching to newer version of clang + lld (build from this commit) and
|
FWIW: i see these results using the head of node-ci and building with default args versus latest node23 downloaded from node.js download page. root@inspiron:/tmp/foo# ./node23 --version
v23.3.0
root@inspiron:/tmp/foo# ./node-ci --version
v24.0.0-pre
root@inspiron:/tmp/foo# cat /proc/cpuinfo | head -n 5 | grep "model name" | cut -d " " -f 3,4,5
Intel(R) Core(TM) i5-8250U
root@inspiron:/tmp/foo# nice -n 20 taskset --cpu-list 0 hyperfine --warmup 50 --runs 100 "./node-ci semi.js" "./node23 semi.js" 2>/dev/null
Benchmark 1: ./node-ci semi.js
Time (mean ± σ): 18.0 ms ± 0.2 ms [User: 11.6 ms, System: 6.0 ms]
Range (min … max): 17.8 ms … 19.1 ms 100 runs
Benchmark 2: ./node23 semi.js
Time (mean ± σ): 21.5 ms ± 0.3 ms [User: 13.1 ms, System: 8.1 ms]
Range (min … max): 21.3 ms … 23.3 ms 100 runs
Summary
'./node-ci semi.js' ran
1.19 ± 0.02 times faster than './node23 semi.js'
root@inspiron:/tmp/foo# ldd ./node-ci
linux-vdso.so.1 (0x00007ffce95e6000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000777165d19000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x0000777169c96000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000777165a00000)
/lib64/ld-linux-x86-64.so.2 (0x0000777169ccd000)
root@inspiron:/tmp/foo# ldd ./node23
linux-vdso.so.1 (0x00007ffe405f7000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007d402d374000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007d402d000000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007d402d28d000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007d402d26d000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007d402d268000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007d402cc00000)
/lib64/ld-linux-x86-64.so.2 (0x00007d402d390000) andrew@Mac node-ci % ./node23 --version
v23.3.0
andrew@Mac node-ci % ./node-ci --version
v24.0.0-pre
andrew@Mac node-ci % model 2>/dev/null | head -n 5 | tail -n 1
Apple M3 Max
andrew@Mac node-ci % hyperfine --warmup 50 --runs 100 "./node-ci semi.js" "./node23 semi.js" 2>/dev/null
Benchmark 1: ./node-ci semi.js
Time (mean ± σ): 15.6 ms ± 0.1 ms [User: 12.1 ms, System: 2.4 ms]
Range (min … max): 15.4 ms … 15.9 ms 100 runs
Benchmark 2: ./node23 semi.js
Time (mean ± σ): 37.2 ms ± 0.2 ms [User: 27.5 ms, System: 3.0 ms]
Range (min … max): 36.4 ms … 37.7 ms 100 runs
Summary
./node-ci semi.js ran
2.39 ± 0.02 times faster than ./node23 semi.js
andrew@Mac node-ci % otool -L ./node-ci
./node-ci:
/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 3107.0.0)
/System/Library/Frameworks/Foundation.framework/Versions/C/Foundation (compatibility version 300.0.0, current version 3107.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1351.0.0)
andrew@Mac node-ci % otool -L ./node23
./node23:
/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 1971.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1319.100.3)
/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1500.65.0) so... huge difference on M3 Max but very small difference on Ubuntu 22.04/8th gen i5. on linux, we can see libc++ is statically linked for node-ci but not for v23. seems to be same on macos but node-ci is linking against an extra foundation library. |
if i just do linux/i5nice -n 20 taskset --cpu-list 0 hyperfine --warmup 100 --runs 200 "./node-ci --version" "./node23 --version" 2>/dev/null
Benchmark 1: ./node-ci --version
Time (mean ± σ): 3.1 ms ± 0.1 ms [User: 1.2 ms, System: 2.0 ms]
Range (min … max): 3.0 ms … 3.9 ms 200 runs
Benchmark 2: ./node23 --version
Time (mean ± σ): 3.8 ms ± 0.1 ms [User: 1.7 ms, System: 2.1 ms]
Range (min … max): 3.6 ms … 4.5 ms 200 runs
Summary
'./node-ci --version' ran
1.21 ± 0.06 times faster than './node23 --version' macos/m3maxhyperfine --warmup 100 --runs 200 "./node23 --version" "./node-ci --version" 2>/dev/null
Benchmark 1: ./node23 --version
Time (mean ± σ): 30.9 ms ± 0.6 ms [User: 23.0 ms, System: 1.9 ms]
Range (min … max): 29.4 ms … 35.7 ms 200 runs
Benchmark 2: ./node-ci --version
Time (mean ± σ): 9.3 ms ± 0.1 ms [User: 7.2 ms, System: 1.2 ms]
Range (min … max): 9.1 ms … 9.9 ms 200 runs
Summary
./node-ci --version ran
3.31 ± 0.07 times faster than ./node23 --version you can use this shell script to build node-ci on macos or linux: mkdir google
cd google
git clone https://chromium.googlesource.com/chromium/tools/depot_tools.git
export PATH=$(pwd)/depot_tools:$PATH
git clone https://chromium.googlesource.com/v8/node-ci
gclient config https://chromium.googlesource.com/v8/node-ci --unmanaged
gclient sync
CPUS=$(nproc --all)
cd node-ci
make clean
make deps
make -j $CPUS node i did a clean build on the machines i have here. macos M3Max is >10x quicker to do a clean build on 16 threads versus ubuntu 22.04 on 8th gen core i5 / 8 threads.
if there are any other tests i can run over coming days let me know @joyeecheung. |
I think I found out what's happening, which was along the lines of my theory in #180 (comment) - in the v8 upgrade, there were a lot of templated instantiations added, leading to an explosion of weak symbols that are resolved at run time (I think we weren't excluding them very well in the first place, but the explosion of weak symbols from V8's use of templates made the problem more obvious), and then it continue to regress as V8 adds, er, more templated functions.. It's also easy to understand why V8's fork is free from this regression - their builds applies -fvisibility=hidden properly. Fix in: nodejs/node#56275 - locally the startup time is slightly faster than 20.x (as we are properly excluding the symbols now).
Also, it produces a smaller binary
|
Hi folks,
I've been investigating some areas of regression on Node.js 22 and apparently, the Node.js startup has degraded. I've run a pretty simple benchmark
Where the semicolon is just a js file with
;
Although this regression isn't confirmed by our benchmarks:
Please note, the hyperfine benchmarks were run on a local macOS machine -- I couldn't run it in a dedicated one as I'm travelling and I don't have access via SSH to my machines and the nodejs benchmark (misc/startup-core.js) was run in a dedicated aws machine
Does someone has an idea? cc @joyeecheung
The text was updated successfully, but these errors were encountered: