-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BuiltinFunction: Use a generated DAFSA instead of a generated enum #524
Conversation
In the latest builtin .def files, __builtin_altivec_crypto_vsbox was modified to require the "power8-vector" feature
Some notes:
The generate script outputs some information about what is skipped and why, here's its output for generating the $ zig run scripts/generate_builtins_dafsa.zig --main-pkg-path . -- scripts/data/Builtins*.def
__builtin_arm_prefetch: param_str changed to "!"
__builtin_arm_rsr64: param_str changed to "!"
__builtin_arm_wsr64: param_str changed to "!"
__builtin_is_constant_evaluated: unknown language 'CXX_LANG', skipping
objc_msgSend: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSend_fpret: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSend_fp2ret: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSend_stret: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSendSuper: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSendSuper_stret: unknown header 'OBJC_MESSAGE_H', skipping
objc_getClass: unknown header 'OBJC_RUNTIME_H', skipping
objc_getMetaClass: unknown header 'OBJC_RUNTIME_H', skipping
objc_enumerationMutation: unknown header 'OBJC_RUNTIME_H', skipping
objc_read_weak: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_assign_weak: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_assign_ivar: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_assign_global: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_assign_strongCast: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_exception_extract: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_exception_try_enter: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_exception_try_exit: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_exception_match: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_exception_throw: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_sync_enter: unknown header 'OBJC_OBJC_SYNC_H', skipping
objc_sync_exit: unknown header 'OBJC_OBJC_SYNC_H', skipping
NSLog: unknown header 'FOUNDATION_NSOBJCRUNTIME_H', skipping
NSLogv: unknown header 'FOUNDATION_NSOBJCRUNTIME_H', skipping
addressof: '&' in params str 'v*v&', skipping
__addressof: '&' in params str 'v*v&', skipping
as_const: '&' in params str 'v&v&', skipping
forward: '&' in params str 'v&v&', skipping
forward_like: '&' in params str 'v&v&', skipping
move: '&' in params str 'v&v&', skipping
move_if_noexcept: '&' in params str 'v&v&', skipping
__builtin_addressof: '&' in params str 'v*v&', skipping
__builtin_function_start: '&' in params str 'v*v&', skipping
__builtin_coro_resume: unknown language 'COR_LANG', skipping
__builtin_coro_destroy: unknown language 'COR_LANG', skipping
__builtin_coro_done: unknown language 'COR_LANG', skipping
__builtin_coro_promise: unknown language 'COR_LANG', skipping
__builtin_coro_size: unknown language 'COR_LANG', skipping
__builtin_coro_align: unknown language 'COR_LANG', skipping
__builtin_coro_frame: unknown language 'COR_LANG', skipping
__builtin_coro_noop: unknown language 'COR_LANG', skipping
__builtin_coro_free: unknown language 'COR_LANG', skipping
__builtin_coro_id: unknown language 'COR_LANG', skipping
__builtin_coro_alloc: unknown language 'COR_LANG', skipping
__builtin_coro_begin: unknown language 'COR_LANG', skipping
__builtin_coro_end: unknown language 'COR_LANG', skipping
__builtin_coro_suspend: unknown language 'COR_LANG', skipping
read_pipe: unknown language 'OCL_PIPE', skipping
write_pipe: unknown language 'OCL_PIPE', skipping
reserve_read_pipe: unknown language 'OCL_PIPE', skipping
reserve_write_pipe: unknown language 'OCL_PIPE', skipping
commit_write_pipe: unknown language 'OCL_PIPE', skipping
commit_read_pipe: unknown language 'OCL_PIPE', skipping
sub_group_reserve_read_pipe: unknown language 'OCL_PIPE', skipping
sub_group_reserve_write_pipe: unknown language 'OCL_PIPE', skipping
sub_group_commit_read_pipe: unknown language 'OCL_PIPE', skipping
sub_group_commit_write_pipe: unknown language 'OCL_PIPE', skipping
work_group_reserve_read_pipe: unknown language 'OCL_PIPE', skipping
work_group_reserve_write_pipe: unknown language 'OCL_PIPE', skipping
work_group_commit_read_pipe: unknown language 'OCL_PIPE', skipping
work_group_commit_write_pipe: unknown language 'OCL_PIPE', skipping
get_pipe_num_packets: unknown language 'OCL_PIPE', skipping
get_pipe_max_packets: unknown language 'OCL_PIPE', skipping
enqueue_kernel: unknown language 'OCL_DSE', skipping
get_kernel_work_group_size: unknown language 'OCL_DSE', skipping
get_kernel_preferred_work_group_size_multiple: unknown language 'OCL_DSE', skipping
get_kernel_max_sub_group_size_for_ndrange: unknown language 'OCL_DSE', skipping
get_kernel_sub_group_count_for_ndrange: unknown language 'OCL_DSE', skipping
to_global: unknown language 'OCL_GAS', skipping
to_local: unknown language 'OCL_GAS', skipping
to_private: unknown language 'OCL_GAS', skipping
__builtin_store_half: unknown language 'ALL_OCL_LANGUAGES', skipping
__builtin_store_halff: unknown language 'ALL_OCL_LANGUAGES', skipping
__builtin_load_half: unknown language 'ALL_OCL_LANGUAGES', skipping
__builtin_load_halff: unknown language 'ALL_OCL_LANGUAGES', skipping
__builtin_get_device_side_mangled_name: unknown language 'CUDA_LANG', skipping
__builtin_hlsl_wave_active_count_bits: unknown language 'HLSL_LANG', skipping
__builtin_hlsl_create_handle: unknown language 'HLSL_LANG', skipping
__builtin_ms_va_start: '&' in params str 'vc*&.', skipping
__builtin_ms_va_end: '&' in params str 'vc*&', skipping
__builtin_ms_va_copy: '&' in params str 'vc*&c*&', skipping
unknown target for file 'scripts/data/BuiltinsSME.def', skipping
3948 builtins, max params: 12 |
I can't view all the changes from my phone but I'm curious how your manual generation compares to the code that the compiler would generate for you automatically. If you sort the builtins by length (greatest to least) and do: inline for (builtins) |builtin| {
if (std.mem.eql(u8, cur_identifier, builtin)) {
return comptime lookup(<...>);
}
}
return null; How good does LLVM do by default? |
@Validark, that code is basically exactly what
Considerably better than the unsorted version, but the manually generated DAFSA is still quite a bit faster. EDIT: This does show that there's an opportunity for an easy |
Sorry for not digging in to the code more before asking. Very interesting! Looks like there's some improvements that could be made to LLVM (and stringToEnum). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a great improvement!
Future improvements for this would be to generate the file as a build step and store a Builtins.def
like source of truth instead. The same process could then be used for other enums like Diagnostics.Tag
and Attribute.Tag
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the Tag
be changed to a non-exahaustive enum? There also doesn't seem to be a fast path for trying to lookup a name shorter than or longer than exist in the set.
Currently the information required for this is not directly stored anywhere in the generated code. It is possible to get a list of builtins indirectly by iterating over the data array indexes, and then using
Will add that (there's already a |
By non-exhaustive enum I meant |
A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but that also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size, but normally does not allow for associating data related to each word. However, by adding a count of the number of possible words from each node, it becomes possible to also use it to achieve minimal perfect hashing for the set of words (which allows going from word -> unique index as well as unique index -> word). This allows us to store a second array of data so that the DAFSA can be used as a lookup for e.g. BuiltinFunction. Some resources: - https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton - https://web.archive.org/web/20220722224703/http://pages.pathcom.com/~vadco/dawg.html - https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.5272 (this is the origin of the minimal perfect hashing technique) - http://stevehanov.ca/blog/?id=115 What all is going on here: - `process_builtins.py` was removed in favor of a zig implementation that generates the DAFSA - The new zig implementation uses the latest .def files from the LLVM repository, so the set of builtins is different and there are some new macro definitions that are not yet supported (e.g. TARGET_BUILTIN), but otherwise the logic matches the python script wherever possible - The new zig implementation imports `src/builtins/TypeDescription.zig` directly so the calculated max parameter count will always be in sync The results: Microbenchmark of lookup time: # Does a builtin name exist?: ------------------- dafsa ------------------ always found: 88ns per lookup not found (random bytes): 33ns per lookup not found (1 char diff): 86ns per lookup ------------------- enum ------------------- always found: 3648ns per lookup not found (random bytes): 6310ns per lookup not found (1 char diff): 3604ns per lookup dafsa is 41.5x / 191.2x / 41.9x faster # Name -> BuiltinFunction lookup: ------------------- dafsa ------------------ always found: 154ns per lookup not found (random bytes): 32ns per lookup not found (1 char diff): 154ns per lookup ------------------- enum ------------------- always found: 3649ns per lookup not found (random bytes): 6317ns per lookup not found (1 char diff): 3649ns per lookup dafsa is 23.7x / 197.4x / 23.7x faster About 500KiB removed from a ReleaseSmall build of aro: 1.8M arocc-master 1.3M arocc-dafsa $ bloaty arocc-dafsa -- arocc-master FILE SIZE VM SIZE -------------- -------------- +8.1% +22.9Ki +8.1% +22.9Ki .data.rel.ro -0.1% -8 -0.1% -8 .eh_frame_hdr -1.4% -2.41Ki -1.4% -2.41Ki .data -35.5% -49.4Ki -35.5% -49.4Ki .eh_frame -56.8% -201Ki -56.8% -201Ki .rodata -30.4% -258Ki -30.4% -258Ki .text -26.9% -488Ki -26.9% -488Ki TOTAL Compile time of `zig build` is reduced by around a second: $ hyperfine "git checkout master; zig build" "git checkout dafsa; zig build" --prepare "rm -rf zig-cache" --runs 3 --warmup 1 Benchmark Vexu#1: git checkout master; zig build Time (mean ± σ): 11.760 s ± 0.024 s [User: 11.387 s, System: 0.605 s] Range (min … max): 11.745 s … 11.788 s 3 runs Benchmark Vexu#2: git checkout dafsa; zig build Time (mean ± σ): 10.523 s ± 0.016 s [User: 10.121 s, System: 0.608 s] Range (min … max): 10.505 s … 10.534 s 3 runs Summary 'git checkout dafsa; zig build' ran 1.12 ± 0.00 times faster than 'git checkout master; zig build' Unit tests run faster: Benchmark Vexu#1: ./unit-tests-master Time (mean ± σ): 224.9 ms ± 2.6 ms [User: 210.2 ms, System: 14.5 ms] Range (min … max): 220.7 ms … 230.1 ms 13 runs Benchmark Vexu#2: ./unit-tests-dafsa Time (mean ± σ): 97.4 ms ± 2.6 ms [User: 83.1 ms, System: 14.2 ms] Range (min … max): 94.2 ms … 103.5 ms 30 runs Summary './unit-tests-dafsa' ran 2.31 ± 0.07 times faster than './unit-tests-master' Record tests run slightly slower (not sure why): Benchmark Vexu#1: ./record-runner-master test/records Time (mean ± σ): 10.283 s ± 0.037 s [User: 118.281 s, System: 0.261 s] Range (min … max): 10.262 s … 10.327 s 3 runs Benchmark Vexu#2: ./record-runner-dafsa test/records Time (mean ± σ): 10.950 s ± 0.209 s [User: 123.950 s, System: 0.284 s] Range (min … max): 10.796 s … 11.187 s 3 runs Summary './record-runner-master test/records' ran 1.06 ± 0.02 times faster than './record-runner-dafsa test/records' Integration test times are ~unchanged: Benchmark Vexu#1: ./test-runner-master /home/ryan/Programming/zig/arocc/test/cases zig Time (mean ± σ): 382.8 ms ± 7.2 ms [User: 300.9 ms, System: 88.0 ms] Range (min … max): 373.2 ms … 393.2 ms 10 runs Benchmark Vexu#2: ./test-runner-dafsa /home/ryan/Programming/zig/arocc/test/cases zig Time (mean ± σ): 392.9 ms ± 3.5 ms [User: 308.5 ms, System: 90.8 ms] Range (min … max): 388.9 ms … 399.7 ms 10 runs Summary './test-runner-master /home/ryan/Programming/zig/arocc/test/cases zig' ran 1.03 ± 0.02 times faster than './test-runner-dafsa /home/ryan/Programming/zig/arocc/test/cases zig' enum + fast path
Updated:
The length checks only affect the
|
Did the "fast path" improve actual compile times? |
Unsure why you're asking about compile times since the length check is runtime only, but the fast path seems to have had either no effect or a very slightly negative effect on the "always found" and "not found (1 char diff)" benchmarks which always fail the fast path (compare the dafsa results in #524 (comment) vs the OP). May or may not be worth it, but doesn't seem to provide enough of an obvious downside that it's worth worrying about too much. |
Sorry for the ambiguity, by "compile times" I meant how long it takes to parse and do the pass that needs to identify builtins. You say this "fast path" optimization may only slightly worsen performance, I'm saying that if that's the case it's not worth it as an optimization. Your branch predictor can hurt you, and it can backfire. If you had half of all identifiers going down the early out "fast path" and an equal number going down the DAFSA, and the order was completely random so the branch predictor cannot "find a pattern" (get lucky), you would expect to guess wrong 50% of the time. A branch mispredict can waste a lot of cycles. Given the very low number of nanoseconds in your measurements, it seems to me a branch mispredict might actually take up, I don't know, 10% of execution time. In my Accelerated Zig Parser's tokenizer, there is a fast keyword identifier algorithm. I tried adding a "fast path", actually I tried out multiple different early outs, and they all hurt performance. It turned out that the extra work we could avoid with the "fast path" was less than the cost of a mispredict. Since your main path is extremely fast, adding an early out can hurt performance. |
I currently don't have a benchmark that will allow me to test it properly in a real-world scenario, would need to set something up for that. I can fully believe the "fast path" is detrimental, and I'm sure there's plenty of other performance gains that can be squeezed out of my implementation, but I'm okay with leaving that on the table for now. |
A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but that also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size, but normally does not allow for associating data related to each word. However, by adding a count of the number of possible words from each node, it becomes possible to also use it to achieve minimal perfect hashing for the set of words (which allows going from word -> unique index as well as unique index -> word). This allows us to store a second array of data so that the DAFSA can be used as a lookup for e.g. BuiltinFunction.
Some resources:
What all is going on here:
process_builtins.py
was removed in favor of a zig implementation that generates the DAFSAsrc/builtins/TypeDescription.zig
directly so the calculated max parameter count will always be in syncThe results
Microbenchmark of lookup time:
(tested with the same set of builtins in the dafsa and enum code)
Microbenchmark code
About 500KiB removed from a ReleaseSmall build of aro:
Compile time of
zig build
is reduced by around a second:Unit tests run faster:
Record tests run slightly slower (not sure why):
Integration test times are ~unchanged:
Shout out to @Validark for making me aware of DAFSA's.