BuiltinFunction: Use a generated DAFSA instead of a generated enum #524

squeek502 · 2023-10-14T11:13:39Z

A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but that also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size, but normally does not allow for associating data related to each word. However, by adding a count of the number of possible words from each node, it becomes possible to also use it to achieve minimal perfect hashing for the set of words (which allows going from word -> unique index as well as unique index -> word). This allows us to store a second array of data so that the DAFSA can be used as a lookup for e.g. BuiltinFunction.

Some resources:

https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
https://web.archive.org/web/20220722224703/http://pages.pathcom.com/~vadco/dawg.html
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.5272 (this is the origin of the minimal perfect hashing technique)
http://stevehanov.ca/blog/?id=115

What all is going on here:

process_builtins.py was removed in favor of a zig implementation that generates the DAFSA
The new zig implementation uses the latest .def files from the LLVM repository, so the set of builtins is different and there are some new macro definitions that are not yet supported (e.g. TARGET_BUILTIN), but otherwise the logic matches the python script wherever possible
The new zig implementation imports src/builtins/TypeDescription.zig directly so the calculated max parameter count will always be in sync

The results

Microbenchmark of lookup time:

# Does a builtin name exist?:

------------------- dafsa ------------------
            always found: 88ns per lookup
not found (random bytes): 33ns per lookup
 not found (1 char diff): 86ns per lookup
------------------- enum -------------------
            always found: 3648ns per lookup
not found (random bytes): 6310ns per lookup
 not found (1 char diff): 3604ns per lookup

dafsa is 41.5x / 191.2x / 41.9x faster


# Name -> BuiltinFunction lookup:

------------------- dafsa ------------------
            always found: 154ns per lookup
not found (random bytes): 32ns per lookup
 not found (1 char diff): 154ns per lookup
------------------- enum -------------------
            always found: 3649ns per lookup
not found (random bytes): 6317ns per lookup
 not found (1 char diff): 3649ns per lookup

dafsa is 23.7x / 197.4x / 23.7x faster

(tested with the same set of builtins in the dafsa and enum code)

Microbenchmark code

const std = @import("std");
const dafsa_ver = @import("generated_dafsa.zig");
const enum_ver = @import("generated_enum.zig");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer std.debug.assert(gpa.deinit() == .ok);
    const allocator = gpa.allocator();

    var builtins_list = std.ArrayList([]const u8).init(allocator);
    defer builtins_list.deinit();

    var contents = try std.fs.cwd().readFileAlloc(allocator, "builtins.txt", std.math.maxInt(u32));
    defer allocator.free(contents);

    var line_it = std.mem.tokenizeAny(u8, contents, "\r\n");
    while (line_it.next()) |line| {
        try builtins_list.append(line);
    }

    {
        std.debug.print("Lookup time (does name exist?):\n\n", .{});
        std.debug.print("------------------- dafsa ------------------\n", .{});
        const dafsa_found = try benchFound(builtins_list.items, dafsa_ver.exists);
        std.debug.print("            always found: {}ns per lookup\n", .{dafsa_found});
        const dafsa_random = try benchRandomBytes(dafsa_ver.exists);
        std.debug.print("not found (random bytes): {}ns per lookup\n", .{dafsa_random});
        const dafsa_modified = try benchModified(builtins_list.items, dafsa_ver.exists);
        std.debug.print(" not found (1 char diff): {}ns per lookup\n", .{dafsa_modified});
        std.debug.print("------------------- enum -------------------\n", .{});
        const enum_found = try benchFound(builtins_list.items, enum_ver.exists);
        std.debug.print("            always found: {}ns per lookup\n", .{enum_found});
        const enum_random = try benchRandomBytes(enum_ver.exists);
        std.debug.print("not found (random bytes): {}ns per lookup\n", .{enum_random});
        const enum_modified = try benchModified(builtins_list.items, enum_ver.exists);
        std.debug.print(" not found (1 char diff): {}ns per lookup\n", .{enum_modified});

        std.debug.print("\ndafsa is {d:.1}x / {d:.1}x / {d:.1}x faster\n", .{
            @as(f64, @floatFromInt(enum_found)) / @as(f64, @floatFromInt(dafsa_found)),
            @as(f64, @floatFromInt(enum_random)) / @as(f64, @floatFromInt(dafsa_random)),
            @as(f64, @floatFromInt(enum_modified)) / @as(f64, @floatFromInt(dafsa_modified)),
        });
    }
    {
        std.debug.print("\n\nLookup time (name -> BuiltinFunction):\n\n", .{});
        std.debug.print("------------------- dafsa ------------------\n", .{});
        const dafsa_found = try benchFound(builtins_list.items, dafsa_ver.fromName);
        std.debug.print("            always found: {}ns per lookup\n", .{dafsa_found});
        const dafsa_random = try benchRandomBytes(dafsa_ver.fromName);
        std.debug.print("not found (random bytes): {}ns per lookup\n", .{dafsa_random});
        const dafsa_modified = try benchModified(builtins_list.items, dafsa_ver.fromName);
        std.debug.print(" not found (1 char diff): {}ns per lookup\n", .{dafsa_modified});
        std.debug.print("------------------- enum -------------------\n", .{});
        const enum_found = try benchFound(builtins_list.items, enum_ver.fromName);
        std.debug.print("            always found: {}ns per lookup\n", .{enum_found});
        const enum_random = try benchRandomBytes(enum_ver.fromName);
        std.debug.print("not found (random bytes): {}ns per lookup\n", .{enum_random});
        const enum_modified = try benchModified(builtins_list.items, enum_ver.fromName);
        std.debug.print(" not found (1 char diff): {}ns per lookup\n", .{enum_modified});

        std.debug.print("\ndafsa is {d:.1}x / {d:.1}x / {d:.1}x faster\n", .{
            @as(f64, @floatFromInt(enum_found)) / @as(f64, @floatFromInt(dafsa_found)),
            @as(f64, @floatFromInt(enum_random)) / @as(f64, @floatFromInt(dafsa_random)),
            @as(f64, @floatFromInt(enum_modified)) / @as(f64, @floatFromInt(dafsa_modified)),
        });
    }
}

fn benchFound(builtins: []const []const u8, func: anytype) !u64 {
    var prng = std.rand.DefaultPrng.init(0);
    const rand = prng.random();

    var timer = try std.time.Timer.start();

    const iterations = 1_000_000;
    for (0..iterations) |_| {
        const name = builtins[rand.intRangeLessThanBiased(usize, 0, builtins.len)];
        var result = func(name);
        std.mem.doNotOptimizeAway(&result);
    }

    const elapsed = timer.read();

    return elapsed / iterations;
}

fn benchRandomBytes(func: anytype) !u64 {
    var prng = std.rand.DefaultPrng.init(0);
    const rand = prng.random();

    var timer = try std.time.Timer.start();

    const iterations = 1_000_000;
    for (0..iterations) |_| {
        var name_buf: [128]u8 = undefined;
        const name_len = rand.intRangeAtMostBiased(usize, 1, name_buf.len);
        var name = name_buf[0..name_len];
        rand.bytes(name);
        var result = func(name);
        std.mem.doNotOptimizeAway(&result);
    }

    const elapsed = timer.read();

    return elapsed / iterations;
}

fn benchModified(builtins: []const []const u8, func: anytype) !u64 {
    var prng = std.rand.DefaultPrng.init(0);
    const rand = prng.random();

    var timer = try std.time.Timer.start();

    const iterations = 1_000_000;
    for (0..iterations) |_| {
        const name = builtins[rand.intRangeLessThanBiased(usize, 0, builtins.len)];
        var name_buf: [128]u8 = undefined;
        var modified_name = name_buf[0..name.len];
        @memcpy(modified_name, name);
        modified_name[rand.intRangeLessThanBiased(usize, 0, modified_name.len)] = '?';
        var result = func(name);
        std.mem.doNotOptimizeAway(&result);
    }

    const elapsed = timer.read();

    return elapsed / iterations;
}

About 500KiB removed from a ReleaseSmall build of aro:

1.8M arocc-master
1.3M arocc-dafsa

$ bloaty arocc-dafsa -- arocc-master
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +8.1% +22.9Ki  +8.1% +22.9Ki    .data.rel.ro
  -0.1%      -8  -0.1%      -8    .eh_frame_hdr
  -1.4% -2.41Ki  -1.4% -2.41Ki    .data
 -35.5% -49.4Ki -35.5% -49.4Ki    .eh_frame
 -56.8%  -201Ki -56.8%  -201Ki    .rodata
 -30.4%  -258Ki -30.4%  -258Ki    .text
 -26.9%  -488Ki -26.9%  -488Ki    TOTAL

Compile time of zig build is reduced by around a second:

$ hyperfine "git checkout master; zig build" "git checkout dafsa; zig build" --prepare "rm -rf zig-cache" --runs 3 --warmup 1
Benchmark #1: git checkout master; zig build
  Time (mean ± σ):     11.760 s ±  0.024 s    [User: 11.387 s, System: 0.605 s]
  Range (min … max):   11.745 s … 11.788 s    3 runs

Benchmark #2: git checkout dafsa; zig build
  Time (mean ± σ):     10.523 s ±  0.016 s    [User: 10.121 s, System: 0.608 s]
  Range (min … max):   10.505 s … 10.534 s    3 runs

Summary
  'git checkout dafsa; zig build' ran
    1.12 ± 0.00 times faster than 'git checkout master; zig build'

Unit tests run faster:

Benchmark #1: ./unit-tests-master
  Time (mean ± σ):     224.9 ms ±   2.6 ms    [User: 210.2 ms, System: 14.5 ms]
  Range (min … max):   220.7 ms … 230.1 ms    13 runs

Benchmark #2: ./unit-tests-dafsa
  Time (mean ± σ):      97.4 ms ±   2.6 ms    [User: 83.1 ms, System: 14.2 ms]
  Range (min … max):    94.2 ms … 103.5 ms    30 runs

Summary
  './unit-tests-dafsa' ran
    2.31 ± 0.07 times faster than './unit-tests-master'

Record tests run slightly slower (not sure why):

Benchmark #1: ./record-runner-master test/records
  Time (mean ± σ):     10.283 s ±  0.037 s    [User: 118.281 s, System: 0.261 s]
  Range (min … max):   10.262 s … 10.327 s    3 runs

Benchmark #2: ./record-runner-dafsa test/records
  Time (mean ± σ):     10.950 s ±  0.209 s    [User: 123.950 s, System: 0.284 s]
  Range (min … max):   10.796 s … 11.187 s    3 runs

Summary
  './record-runner-master test/records' ran
    1.06 ± 0.02 times faster than './record-runner-dafsa test/records'

Integration test times are ~unchanged:

Benchmark #1: ./test-runner-master /home/ryan/Programming/zig/arocc/test/cases zig
  Time (mean ± σ):     382.8 ms ±   7.2 ms    [User: 300.9 ms, System: 88.0 ms]
  Range (min … max):   373.2 ms … 393.2 ms    10 runs

Benchmark #2: ./test-runner-dafsa /home/ryan/Programming/zig/arocc/test/cases zig
  Time (mean ± σ):     392.9 ms ±   3.5 ms    [User: 308.5 ms, System: 90.8 ms]
  Range (min … max):   388.9 ms … 399.7 ms    10 runs

Summary
  './test-runner-master /home/ryan/Programming/zig/arocc/test/cases zig' ran
    1.03 ± 0.02 times faster than './test-runner-dafsa /home/ryan/Programming/zig/arocc/test/cases zig'

Shout out to @Validark for making me aware of DAFSA's.

In the latest builtin .def files, __builtin_altivec_crypto_vsbox was modified to require the "power8-vector" feature

squeek502 · 2023-10-14T11:21:29Z

Some notes:

Here's a link to the new BuiltinFunction.zig since GitHub gives up on showing the diff: https://github.com/squeek502/arocc/blob/dafsa/src/builtins/BuiltinFunction.zig
The fields of BuiltinFunction were kept the same just to make the changes to the source code outside of BuiltinFunction.zig minimal, but because going from Tag to BuiltinFunction is now essentially free (it's just an array access), it probably makes sense to always just store BuiltinFunction.Tag and never BuiltinFunction. This would also allow removing the tag field from BuiltinFunction (but it doesn't reduce the @sizeOf(BuiltinFunction) so removing the field wouldn't necessarily provide an immediate benefit)
I just downloaded whatever the latest .def files were from https://github.com/llvm/llvm-project/tree/main/clang/include/clang/Basic and used those, which may not be what you want (maybe a tagged release of LLVM would be preferred?).
The new generate_builtins_dafsa.zig cannot handle macros that span multiple lines, which there are some of in the .def files (but AFAICT they are only used in new macros that aren't supported like TARGET_BUILTIN)
There are new macros that are unsupported: TARGET_BUILTIN, CUSTOM_BUILTIN, and UNALIASED_CUSTOM_BUILTIN
There is a new .def file that is skipped: BuiltinsSME.def

The generate script outputs some information about what is skipped and why, here's its output for generating the BuiltinFunction.zig in this PR:

$ zig run scripts/generate_builtins_dafsa.zig --main-pkg-path . -- scripts/data/Builtins*.def
__builtin_arm_prefetch: param_str changed to "!"
__builtin_arm_rsr64: param_str changed to "!"
__builtin_arm_wsr64: param_str changed to "!"
__builtin_is_constant_evaluated: unknown language 'CXX_LANG', skipping
objc_msgSend: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSend_fpret: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSend_fp2ret: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSend_stret: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSendSuper: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSendSuper_stret: unknown header 'OBJC_MESSAGE_H', skipping
objc_getClass: unknown header 'OBJC_RUNTIME_H', skipping
objc_getMetaClass: unknown header 'OBJC_RUNTIME_H', skipping
objc_enumerationMutation: unknown header 'OBJC_RUNTIME_H', skipping
objc_read_weak: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_assign_weak: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_assign_ivar: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_assign_global: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_assign_strongCast: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_exception_extract: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_exception_try_enter: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_exception_try_exit: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_exception_match: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_exception_throw: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_sync_enter: unknown header 'OBJC_OBJC_SYNC_H', skipping
objc_sync_exit: unknown header 'OBJC_OBJC_SYNC_H', skipping
NSLog: unknown header 'FOUNDATION_NSOBJCRUNTIME_H', skipping
NSLogv: unknown header 'FOUNDATION_NSOBJCRUNTIME_H', skipping
addressof: '&' in params str 'v*v&', skipping
__addressof: '&' in params str 'v*v&', skipping
as_const: '&' in params str 'v&v&', skipping
forward: '&' in params str 'v&v&', skipping
forward_like: '&' in params str 'v&v&', skipping
move: '&' in params str 'v&v&', skipping
move_if_noexcept: '&' in params str 'v&v&', skipping
__builtin_addressof: '&' in params str 'v*v&', skipping
__builtin_function_start: '&' in params str 'v*v&', skipping
__builtin_coro_resume: unknown language 'COR_LANG', skipping
__builtin_coro_destroy: unknown language 'COR_LANG', skipping
__builtin_coro_done: unknown language 'COR_LANG', skipping
__builtin_coro_promise: unknown language 'COR_LANG', skipping
__builtin_coro_size: unknown language 'COR_LANG', skipping
__builtin_coro_align: unknown language 'COR_LANG', skipping
__builtin_coro_frame: unknown language 'COR_LANG', skipping
__builtin_coro_noop: unknown language 'COR_LANG', skipping
__builtin_coro_free: unknown language 'COR_LANG', skipping
__builtin_coro_id: unknown language 'COR_LANG', skipping
__builtin_coro_alloc: unknown language 'COR_LANG', skipping
__builtin_coro_begin: unknown language 'COR_LANG', skipping
__builtin_coro_end: unknown language 'COR_LANG', skipping
__builtin_coro_suspend: unknown language 'COR_LANG', skipping
read_pipe: unknown language 'OCL_PIPE', skipping
write_pipe: unknown language 'OCL_PIPE', skipping
reserve_read_pipe: unknown language 'OCL_PIPE', skipping
reserve_write_pipe: unknown language 'OCL_PIPE', skipping
commit_write_pipe: unknown language 'OCL_PIPE', skipping
commit_read_pipe: unknown language 'OCL_PIPE', skipping
sub_group_reserve_read_pipe: unknown language 'OCL_PIPE', skipping
sub_group_reserve_write_pipe: unknown language 'OCL_PIPE', skipping
sub_group_commit_read_pipe: unknown language 'OCL_PIPE', skipping
sub_group_commit_write_pipe: unknown language 'OCL_PIPE', skipping
work_group_reserve_read_pipe: unknown language 'OCL_PIPE', skipping
work_group_reserve_write_pipe: unknown language 'OCL_PIPE', skipping
work_group_commit_read_pipe: unknown language 'OCL_PIPE', skipping
work_group_commit_write_pipe: unknown language 'OCL_PIPE', skipping
get_pipe_num_packets: unknown language 'OCL_PIPE', skipping
get_pipe_max_packets: unknown language 'OCL_PIPE', skipping
enqueue_kernel: unknown language 'OCL_DSE', skipping
get_kernel_work_group_size: unknown language 'OCL_DSE', skipping
get_kernel_preferred_work_group_size_multiple: unknown language 'OCL_DSE', skipping
get_kernel_max_sub_group_size_for_ndrange: unknown language 'OCL_DSE', skipping
get_kernel_sub_group_count_for_ndrange: unknown language 'OCL_DSE', skipping
to_global: unknown language 'OCL_GAS', skipping
to_local: unknown language 'OCL_GAS', skipping
to_private: unknown language 'OCL_GAS', skipping
__builtin_store_half: unknown language 'ALL_OCL_LANGUAGES', skipping
__builtin_store_halff: unknown language 'ALL_OCL_LANGUAGES', skipping
__builtin_load_half: unknown language 'ALL_OCL_LANGUAGES', skipping
__builtin_load_halff: unknown language 'ALL_OCL_LANGUAGES', skipping
__builtin_get_device_side_mangled_name: unknown language 'CUDA_LANG', skipping
__builtin_hlsl_wave_active_count_bits: unknown language 'HLSL_LANG', skipping
__builtin_hlsl_create_handle: unknown language 'HLSL_LANG', skipping
__builtin_ms_va_start: '&' in params str 'vc*&.', skipping
__builtin_ms_va_end: '&' in params str 'vc*&', skipping
__builtin_ms_va_copy: '&' in params str 'vc*&c*&', skipping
unknown target for file 'scripts/data/BuiltinsSME.def', skipping
3948 builtins, max params: 12

Validark · 2023-10-14T11:53:12Z

I can't view all the changes from my phone but I'm curious how your manual generation compares to the code that the compiler would generate for you automatically. If you sort the builtins by length (greatest to least) and do:

inline for (builtins) |builtin| {
    if (std.mem.eql(u8, cur_identifier, builtin)) {
        return comptime lookup(<...>);
    }
}
return null;

How good does LLVM do by default?

squeek502 · 2023-10-14T12:08:52Z

@Validark, that code is basically exactly what std.meta.stringToEnum does for large enums, so here's the benchmark with the enum fields sorted by length:

Lookup time (does name exist?):

------------------- dafsa ------------------
            always found: 83ns per lookup
not found (random bytes): 31ns per lookup
 not found (1 char diff): 83ns per lookup
------------------- enum -------------------
            always found: 1191ns per lookup
not found (random bytes): 68ns per lookup
 not found (1 char diff): 1184ns per lookup

dafsa is 14.3x / 2.2x / 14.3x faster


Lookup time (name -> BuiltinFunction):

------------------- dafsa ------------------
            always found: 154ns per lookup
not found (random bytes): 34ns per lookup
 not found (1 char diff): 156ns per lookup
------------------- enum -------------------
            always found: 1186ns per lookup
not found (random bytes): 68ns per lookup
 not found (1 char diff): 1176ns per lookup

dafsa is 7.7x / 2.0x / 7.5x faster

Considerably better than the unsorted version, but the manually generated DAFSA is still quite a bit faster.

EDIT: This does show that there's an opportunity for an easy stringToEnum optimization (at the cost of compilation time) if the fields were sorted by length before the inline for.

Validark · 2023-10-14T12:14:46Z

Sorry for not digging in to the code more before asking. Very interesting! Looks like there's some improvements that could be made to LLVM (and stringToEnum).

Vexu

This looks like a great improvement!

Future improvements for this would be to generate the file as a build step and store a Builtins.def like source of truth instead. The same process could then be used for other enums like Diagnostics.Tag and Attribute.Tag.

Vexu

Could the Tag be changed to a non-exahaustive enum? There also doesn't seem to be a fast path for trying to lookup a name shorter than or longer than exist in the set.

scripts/generate_builtins_dafsa.zig

squeek502 · 2023-10-14T21:20:13Z

Could the Tag be changed to a non-exahaustive enum?

Currently the information required for this is not directly stored anywhere in the generated code. It is possible to get a list of builtins indirectly by iterating over the data array indexes, and then using nameFromTag to get the name from the DAFSA which requires a buffer to write the name into (this is what BuiltinsIterator does).

There also doesn't seem to be a fast path for trying to lookup a name shorter than or longer than exist in the set.

Will add that (there's already a longest_builtin_name const used by BuiltinsIterator for its name buffer). However, shortest will be 3 currently, so unsure if it'll make anything faster.

Vexu · 2023-10-14T21:25:56Z

By non-exhaustive enum I meant enum(u16) { _ } so that it doesn't cast automatically to other types.

A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but that also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size, but normally does not allow for associating data related to each word. However, by adding a count of the number of possible words from each node, it becomes possible to also use it to achieve minimal perfect hashing for the set of words (which allows going from word -> unique index as well as unique index -> word). This allows us to store a second array of data so that the DAFSA can be used as a lookup for e.g. BuiltinFunction. Some resources: - https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton - https://web.archive.org/web/20220722224703/http://pages.pathcom.com/~vadco/dawg.html - https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.5272 (this is the origin of the minimal perfect hashing technique) - http://stevehanov.ca/blog/?id=115 What all is going on here: - `process_builtins.py` was removed in favor of a zig implementation that generates the DAFSA - The new zig implementation uses the latest .def files from the LLVM repository, so the set of builtins is different and there are some new macro definitions that are not yet supported (e.g. TARGET_BUILTIN), but otherwise the logic matches the python script wherever possible - The new zig implementation imports `src/builtins/TypeDescription.zig` directly so the calculated max parameter count will always be in sync The results: Microbenchmark of lookup time: # Does a builtin name exist?: ------------------- dafsa ------------------ always found: 88ns per lookup not found (random bytes): 33ns per lookup not found (1 char diff): 86ns per lookup ------------------- enum ------------------- always found: 3648ns per lookup not found (random bytes): 6310ns per lookup not found (1 char diff): 3604ns per lookup dafsa is 41.5x / 191.2x / 41.9x faster # Name -> BuiltinFunction lookup: ------------------- dafsa ------------------ always found: 154ns per lookup not found (random bytes): 32ns per lookup not found (1 char diff): 154ns per lookup ------------------- enum ------------------- always found: 3649ns per lookup not found (random bytes): 6317ns per lookup not found (1 char diff): 3649ns per lookup dafsa is 23.7x / 197.4x / 23.7x faster About 500KiB removed from a ReleaseSmall build of aro: 1.8M arocc-master 1.3M arocc-dafsa $ bloaty arocc-dafsa -- arocc-master FILE SIZE VM SIZE -------------- -------------- +8.1% +22.9Ki +8.1% +22.9Ki .data.rel.ro -0.1% -8 -0.1% -8 .eh_frame_hdr -1.4% -2.41Ki -1.4% -2.41Ki .data -35.5% -49.4Ki -35.5% -49.4Ki .eh_frame -56.8% -201Ki -56.8% -201Ki .rodata -30.4% -258Ki -30.4% -258Ki .text -26.9% -488Ki -26.9% -488Ki TOTAL Compile time of `zig build` is reduced by around a second: $ hyperfine "git checkout master; zig build" "git checkout dafsa; zig build" --prepare "rm -rf zig-cache" --runs 3 --warmup 1 Benchmark Vexu#1: git checkout master; zig build Time (mean ± σ): 11.760 s ± 0.024 s [User: 11.387 s, System: 0.605 s] Range (min … max): 11.745 s … 11.788 s 3 runs Benchmark Vexu#2: git checkout dafsa; zig build Time (mean ± σ): 10.523 s ± 0.016 s [User: 10.121 s, System: 0.608 s] Range (min … max): 10.505 s … 10.534 s 3 runs Summary 'git checkout dafsa; zig build' ran 1.12 ± 0.00 times faster than 'git checkout master; zig build' Unit tests run faster: Benchmark Vexu#1: ./unit-tests-master Time (mean ± σ): 224.9 ms ± 2.6 ms [User: 210.2 ms, System: 14.5 ms] Range (min … max): 220.7 ms … 230.1 ms 13 runs Benchmark Vexu#2: ./unit-tests-dafsa Time (mean ± σ): 97.4 ms ± 2.6 ms [User: 83.1 ms, System: 14.2 ms] Range (min … max): 94.2 ms … 103.5 ms 30 runs Summary './unit-tests-dafsa' ran 2.31 ± 0.07 times faster than './unit-tests-master' Record tests run slightly slower (not sure why): Benchmark Vexu#1: ./record-runner-master test/records Time (mean ± σ): 10.283 s ± 0.037 s [User: 118.281 s, System: 0.261 s] Range (min … max): 10.262 s … 10.327 s 3 runs Benchmark Vexu#2: ./record-runner-dafsa test/records Time (mean ± σ): 10.950 s ± 0.209 s [User: 123.950 s, System: 0.284 s] Range (min … max): 10.796 s … 11.187 s 3 runs Summary './record-runner-master test/records' ran 1.06 ± 0.02 times faster than './record-runner-dafsa test/records' Integration test times are ~unchanged: Benchmark Vexu#1: ./test-runner-master /home/ryan/Programming/zig/arocc/test/cases zig Time (mean ± σ): 382.8 ms ± 7.2 ms [User: 300.9 ms, System: 88.0 ms] Range (min … max): 373.2 ms … 393.2 ms 10 runs Benchmark Vexu#2: ./test-runner-dafsa /home/ryan/Programming/zig/arocc/test/cases zig Time (mean ± σ): 392.9 ms ± 3.5 ms [User: 308.5 ms, System: 90.8 ms] Range (min … max): 388.9 ms … 399.7 ms 10 runs Summary './test-runner-master /home/ryan/Programming/zig/arocc/test/cases zig' ran 1.03 ± 0.02 times faster than './test-runner-dafsa /home/ryan/Programming/zig/arocc/test/cases zig' enum + fast path

squeek502 · 2023-10-15T00:08:42Z

Updated:

Tag is now enum(u16) { _ } (this never even crossed my mind as a possibility)
Updated --main-pkg-path to --main-mod-path
There are fast path length checks in exists and uniqueIndex

The length checks only affect the not found (random bytes) benchmark (since that's the only one where the length could be outside of the shortest...longest range):

Lookup time (does name exist?):

------------------- dafsa ------------------
            always found: 88ns per lookup
not found (random bytes): 24ns per lookup
 not found (1 char diff): 87ns per lookup
------------------- enum -------------------
            always found: 3682ns per lookup
not found (random bytes): 6681ns per lookup
 not found (1 char diff): 3729ns per lookup

dafsa is 41.8x / 278.4x / 42.9x faster


Lookup time (name -> BuiltinFunction):

------------------- dafsa ------------------
            always found: 164ns per lookup
not found (random bytes): 25ns per lookup
 not found (1 char diff): 159ns per lookup
------------------- enum -------------------
            always found: 3741ns per lookup
not found (random bytes): 6919ns per lookup
 not found (1 char diff): 3688ns per lookup

dafsa is 22.8x / 276.8x / 23.2x faster

Validark · 2023-10-15T12:06:14Z

Did the "fast path" improve actual compile times?

squeek502 · 2023-10-15T19:41:38Z

Unsure why you're asking about compile times since the length check is runtime only, but the fast path seems to have had either no effect or a very slightly negative effect on the "always found" and "not found (1 char diff)" benchmarks which always fail the fast path (compare the dafsa results in #524 (comment) vs the OP).

May or may not be worth it, but doesn't seem to provide enough of an obvious downside that it's worth worrying about too much.

Validark · 2023-10-16T17:47:50Z

Sorry for the ambiguity, by "compile times" I meant how long it takes to parse and do the pass that needs to identify builtins. You say this "fast path" optimization may only slightly worsen performance, I'm saying that if that's the case it's not worth it as an optimization.

Your branch predictor can hurt you, and it can backfire. If you had half of all identifiers going down the early out "fast path" and an equal number going down the DAFSA, and the order was completely random so the branch predictor cannot "find a pattern" (get lucky), you would expect to guess wrong 50% of the time. A branch mispredict can waste a lot of cycles. Given the very low number of nanoseconds in your measurements, it seems to me a branch mispredict might actually take up, I don't know, 10% of execution time.

In my Accelerated Zig Parser's tokenizer, there is a fast keyword identifier algorithm. I tried adding a "fast path", actually I tried out multiple different early outs, and they all hurt performance. It turned out that the extra work we could avoid with the "fast path" was less than the cost of a mispredict. Since your main path is extremely fast, adding an early out can hurt performance.

squeek502 · 2023-10-16T19:02:10Z

I currently don't have a benchmark that will allow me to test it properly in a real-world scenario, would need to set something up for that. I can fully believe the "fast path" is detrimental, and I'm sure there's plenty of other performance gains that can be squeezed out of my implementation, but I'm okay with leaving that on the table for now.

squeek502 added 2 commits October 14, 2023 02:38

Remove feature-specific builtin from builtin functions test

b999679

In the latest builtin .def files, __builtin_altivec_crypto_vsbox was modified to require the "power8-vector" feature

Add a few new builtin parameter type letters from Builtins.def

77e2e6b

Vexu approved these changes Oct 14, 2023

View reviewed changes

Vexu reviewed Oct 14, 2023

View reviewed changes

scripts/generate_builtins_dafsa.zig Outdated Show resolved Hide resolved

squeek502 force-pushed the dafsa branch from b226d56 to 1cc1e95 Compare October 15, 2023 00:05

This was referenced Oct 15, 2023

stringToEnum: Sort enum fields before inline for ziglang/zig#17527

Closed

Optimise stringToEnum ziglang/zig#3863

Open

Vexu merged commit 0ef9708 into Vexu:master Oct 15, 2023
3 checks passed

squeek502 mentioned this pull request Jul 6, 2024

Tokenizer: Implement character references kristoff-it/superhtml#11

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BuiltinFunction: Use a generated DAFSA instead of a generated enum #524

BuiltinFunction: Use a generated DAFSA instead of a generated enum #524

squeek502 commented Oct 14, 2023 •

edited

Loading

squeek502 commented Oct 14, 2023 •

edited

Loading

Validark commented Oct 14, 2023 •

edited

Loading

squeek502 commented Oct 14, 2023 •

edited

Loading

Validark commented Oct 14, 2023

Vexu left a comment

Vexu left a comment

squeek502 commented Oct 14, 2023 •

edited

Loading

Vexu commented Oct 14, 2023

squeek502 commented Oct 15, 2023 •

edited

Loading

Validark commented Oct 15, 2023 •

edited

Loading

squeek502 commented Oct 15, 2023

Validark commented Oct 16, 2023

squeek502 commented Oct 16, 2023

BuiltinFunction: Use a generated DAFSA instead of a generated enum #524

BuiltinFunction: Use a generated DAFSA instead of a generated enum #524

Conversation

squeek502 commented Oct 14, 2023 • edited Loading

The results

squeek502 commented Oct 14, 2023 • edited Loading

Validark commented Oct 14, 2023 • edited Loading

squeek502 commented Oct 14, 2023 • edited Loading

Validark commented Oct 14, 2023

Vexu left a comment

Choose a reason for hiding this comment

Vexu left a comment

Choose a reason for hiding this comment

squeek502 commented Oct 14, 2023 • edited Loading

Vexu commented Oct 14, 2023

squeek502 commented Oct 15, 2023 • edited Loading

Validark commented Oct 15, 2023 • edited Loading

squeek502 commented Oct 15, 2023

Validark commented Oct 16, 2023

squeek502 commented Oct 16, 2023

squeek502 commented Oct 14, 2023 •

edited

Loading

squeek502 commented Oct 14, 2023 •

edited

Loading

Validark commented Oct 14, 2023 •

edited

Loading

squeek502 commented Oct 14, 2023 •

edited

Loading

squeek502 commented Oct 14, 2023 •

edited

Loading

squeek502 commented Oct 15, 2023 •

edited

Loading

Validark commented Oct 15, 2023 •

edited

Loading