Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BuiltinFunction: Use a generated DAFSA instead of a generated enum #524

Merged
merged 3 commits into from
Oct 15, 2023

Conversation

squeek502
Copy link
Contributor

@squeek502 squeek502 commented Oct 14, 2023

A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but that also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size, but normally does not allow for associating data related to each word. However, by adding a count of the number of possible words from each node, it becomes possible to also use it to achieve minimal perfect hashing for the set of words (which allows going from word -> unique index as well as unique index -> word). This allows us to store a second array of data so that the DAFSA can be used as a lookup for e.g. BuiltinFunction.

Some resources:

What all is going on here:

  • process_builtins.py was removed in favor of a zig implementation that generates the DAFSA
  • The new zig implementation uses the latest .def files from the LLVM repository, so the set of builtins is different and there are some new macro definitions that are not yet supported (e.g. TARGET_BUILTIN), but otherwise the logic matches the python script wherever possible
  • The new zig implementation imports src/builtins/TypeDescription.zig directly so the calculated max parameter count will always be in sync

The results

Microbenchmark of lookup time:

# Does a builtin name exist?:

------------------- dafsa ------------------
            always found: 88ns per lookup
not found (random bytes): 33ns per lookup
 not found (1 char diff): 86ns per lookup
------------------- enum -------------------
            always found: 3648ns per lookup
not found (random bytes): 6310ns per lookup
 not found (1 char diff): 3604ns per lookup

dafsa is 41.5x / 191.2x / 41.9x faster


# Name -> BuiltinFunction lookup:

------------------- dafsa ------------------
            always found: 154ns per lookup
not found (random bytes): 32ns per lookup
 not found (1 char diff): 154ns per lookup
------------------- enum -------------------
            always found: 3649ns per lookup
not found (random bytes): 6317ns per lookup
 not found (1 char diff): 3649ns per lookup

dafsa is 23.7x / 197.4x / 23.7x faster

(tested with the same set of builtins in the dafsa and enum code)

Microbenchmark code
const std = @import("std");
const dafsa_ver = @import("generated_dafsa.zig");
const enum_ver = @import("generated_enum.zig");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer std.debug.assert(gpa.deinit() == .ok);
    const allocator = gpa.allocator();

    var builtins_list = std.ArrayList([]const u8).init(allocator);
    defer builtins_list.deinit();

    var contents = try std.fs.cwd().readFileAlloc(allocator, "builtins.txt", std.math.maxInt(u32));
    defer allocator.free(contents);

    var line_it = std.mem.tokenizeAny(u8, contents, "\r\n");
    while (line_it.next()) |line| {
        try builtins_list.append(line);
    }

    {
        std.debug.print("Lookup time (does name exist?):\n\n", .{});
        std.debug.print("------------------- dafsa ------------------\n", .{});
        const dafsa_found = try benchFound(builtins_list.items, dafsa_ver.exists);
        std.debug.print("            always found: {}ns per lookup\n", .{dafsa_found});
        const dafsa_random = try benchRandomBytes(dafsa_ver.exists);
        std.debug.print("not found (random bytes): {}ns per lookup\n", .{dafsa_random});
        const dafsa_modified = try benchModified(builtins_list.items, dafsa_ver.exists);
        std.debug.print(" not found (1 char diff): {}ns per lookup\n", .{dafsa_modified});
        std.debug.print("------------------- enum -------------------\n", .{});
        const enum_found = try benchFound(builtins_list.items, enum_ver.exists);
        std.debug.print("            always found: {}ns per lookup\n", .{enum_found});
        const enum_random = try benchRandomBytes(enum_ver.exists);
        std.debug.print("not found (random bytes): {}ns per lookup\n", .{enum_random});
        const enum_modified = try benchModified(builtins_list.items, enum_ver.exists);
        std.debug.print(" not found (1 char diff): {}ns per lookup\n", .{enum_modified});

        std.debug.print("\ndafsa is {d:.1}x / {d:.1}x / {d:.1}x faster\n", .{
            @as(f64, @floatFromInt(enum_found)) / @as(f64, @floatFromInt(dafsa_found)),
            @as(f64, @floatFromInt(enum_random)) / @as(f64, @floatFromInt(dafsa_random)),
            @as(f64, @floatFromInt(enum_modified)) / @as(f64, @floatFromInt(dafsa_modified)),
        });
    }
    {
        std.debug.print("\n\nLookup time (name -> BuiltinFunction):\n\n", .{});
        std.debug.print("------------------- dafsa ------------------\n", .{});
        const dafsa_found = try benchFound(builtins_list.items, dafsa_ver.fromName);
        std.debug.print("            always found: {}ns per lookup\n", .{dafsa_found});
        const dafsa_random = try benchRandomBytes(dafsa_ver.fromName);
        std.debug.print("not found (random bytes): {}ns per lookup\n", .{dafsa_random});
        const dafsa_modified = try benchModified(builtins_list.items, dafsa_ver.fromName);
        std.debug.print(" not found (1 char diff): {}ns per lookup\n", .{dafsa_modified});
        std.debug.print("------------------- enum -------------------\n", .{});
        const enum_found = try benchFound(builtins_list.items, enum_ver.fromName);
        std.debug.print("            always found: {}ns per lookup\n", .{enum_found});
        const enum_random = try benchRandomBytes(enum_ver.fromName);
        std.debug.print("not found (random bytes): {}ns per lookup\n", .{enum_random});
        const enum_modified = try benchModified(builtins_list.items, enum_ver.fromName);
        std.debug.print(" not found (1 char diff): {}ns per lookup\n", .{enum_modified});

        std.debug.print("\ndafsa is {d:.1}x / {d:.1}x / {d:.1}x faster\n", .{
            @as(f64, @floatFromInt(enum_found)) / @as(f64, @floatFromInt(dafsa_found)),
            @as(f64, @floatFromInt(enum_random)) / @as(f64, @floatFromInt(dafsa_random)),
            @as(f64, @floatFromInt(enum_modified)) / @as(f64, @floatFromInt(dafsa_modified)),
        });
    }
}

fn benchFound(builtins: []const []const u8, func: anytype) !u64 {
    var prng = std.rand.DefaultPrng.init(0);
    const rand = prng.random();

    var timer = try std.time.Timer.start();

    const iterations = 1_000_000;
    for (0..iterations) |_| {
        const name = builtins[rand.intRangeLessThanBiased(usize, 0, builtins.len)];
        var result = func(name);
        std.mem.doNotOptimizeAway(&result);
    }

    const elapsed = timer.read();

    return elapsed / iterations;
}

fn benchRandomBytes(func: anytype) !u64 {
    var prng = std.rand.DefaultPrng.init(0);
    const rand = prng.random();

    var timer = try std.time.Timer.start();

    const iterations = 1_000_000;
    for (0..iterations) |_| {
        var name_buf: [128]u8 = undefined;
        const name_len = rand.intRangeAtMostBiased(usize, 1, name_buf.len);
        var name = name_buf[0..name_len];
        rand.bytes(name);
        var result = func(name);
        std.mem.doNotOptimizeAway(&result);
    }

    const elapsed = timer.read();

    return elapsed / iterations;
}

fn benchModified(builtins: []const []const u8, func: anytype) !u64 {
    var prng = std.rand.DefaultPrng.init(0);
    const rand = prng.random();

    var timer = try std.time.Timer.start();

    const iterations = 1_000_000;
    for (0..iterations) |_| {
        const name = builtins[rand.intRangeLessThanBiased(usize, 0, builtins.len)];
        var name_buf: [128]u8 = undefined;
        var modified_name = name_buf[0..name.len];
        @memcpy(modified_name, name);
        modified_name[rand.intRangeLessThanBiased(usize, 0, modified_name.len)] = '?';
        var result = func(name);
        std.mem.doNotOptimizeAway(&result);
    }

    const elapsed = timer.read();

    return elapsed / iterations;
}

About 500KiB removed from a ReleaseSmall build of aro:

1.8M arocc-master
1.3M arocc-dafsa

$ bloaty arocc-dafsa -- arocc-master
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +8.1% +22.9Ki  +8.1% +22.9Ki    .data.rel.ro
  -0.1%      -8  -0.1%      -8    .eh_frame_hdr
  -1.4% -2.41Ki  -1.4% -2.41Ki    .data
 -35.5% -49.4Ki -35.5% -49.4Ki    .eh_frame
 -56.8%  -201Ki -56.8%  -201Ki    .rodata
 -30.4%  -258Ki -30.4%  -258Ki    .text
 -26.9%  -488Ki -26.9%  -488Ki    TOTAL

Compile time of zig build is reduced by around a second:

$ hyperfine "git checkout master; zig build" "git checkout dafsa; zig build" --prepare "rm -rf zig-cache" --runs 3 --warmup 1
Benchmark #1: git checkout master; zig build
  Time (mean ± σ):     11.760 s ±  0.024 s    [User: 11.387 s, System: 0.605 s]
  Range (min … max):   11.745 s … 11.788 s    3 runs

Benchmark #2: git checkout dafsa; zig build
  Time (mean ± σ):     10.523 s ±  0.016 s    [User: 10.121 s, System: 0.608 s]
  Range (min … max):   10.505 s … 10.534 s    3 runs

Summary
  'git checkout dafsa; zig build' ran
    1.12 ± 0.00 times faster than 'git checkout master; zig build'

Unit tests run faster:

Benchmark #1: ./unit-tests-master
  Time (mean ± σ):     224.9 ms ±   2.6 ms    [User: 210.2 ms, System: 14.5 ms]
  Range (min … max):   220.7 ms … 230.1 ms    13 runs

Benchmark #2: ./unit-tests-dafsa
  Time (mean ± σ):      97.4 ms ±   2.6 ms    [User: 83.1 ms, System: 14.2 ms]
  Range (min … max):    94.2 ms … 103.5 ms    30 runs

Summary
  './unit-tests-dafsa' ran
    2.31 ± 0.07 times faster than './unit-tests-master'

Record tests run slightly slower (not sure why):

Benchmark #1: ./record-runner-master test/records
  Time (mean ± σ):     10.283 s ±  0.037 s    [User: 118.281 s, System: 0.261 s]
  Range (min … max):   10.262 s … 10.327 s    3 runs

Benchmark #2: ./record-runner-dafsa test/records
  Time (mean ± σ):     10.950 s ±  0.209 s    [User: 123.950 s, System: 0.284 s]
  Range (min … max):   10.796 s … 11.187 s    3 runs

Summary
  './record-runner-master test/records' ran
    1.06 ± 0.02 times faster than './record-runner-dafsa test/records'

Integration test times are ~unchanged:

Benchmark #1: ./test-runner-master /home/ryan/Programming/zig/arocc/test/cases zig
  Time (mean ± σ):     382.8 ms ±   7.2 ms    [User: 300.9 ms, System: 88.0 ms]
  Range (min … max):   373.2 ms … 393.2 ms    10 runs

Benchmark #2: ./test-runner-dafsa /home/ryan/Programming/zig/arocc/test/cases zig
  Time (mean ± σ):     392.9 ms ±   3.5 ms    [User: 308.5 ms, System: 90.8 ms]
  Range (min … max):   388.9 ms … 399.7 ms    10 runs

Summary
  './test-runner-master /home/ryan/Programming/zig/arocc/test/cases zig' ran
    1.03 ± 0.02 times faster than './test-runner-dafsa /home/ryan/Programming/zig/arocc/test/cases zig'

Shout out to @Validark for making me aware of DAFSA's.

In the latest builtin .def files, __builtin_altivec_crypto_vsbox was modified to require the "power8-vector" feature
@squeek502
Copy link
Contributor Author

squeek502 commented Oct 14, 2023

Some notes:

  • Here's a link to the new BuiltinFunction.zig since GitHub gives up on showing the diff: https://github.com/squeek502/arocc/blob/dafsa/src/builtins/BuiltinFunction.zig
  • The fields of BuiltinFunction were kept the same just to make the changes to the source code outside of BuiltinFunction.zig minimal, but because going from Tag to BuiltinFunction is now essentially free (it's just an array access), it probably makes sense to always just store BuiltinFunction.Tag and never BuiltinFunction. This would also allow removing the tag field from BuiltinFunction (but it doesn't reduce the @sizeOf(BuiltinFunction) so removing the field wouldn't necessarily provide an immediate benefit)
  • I just downloaded whatever the latest .def files were from https://github.com/llvm/llvm-project/tree/main/clang/include/clang/Basic and used those, which may not be what you want (maybe a tagged release of LLVM would be preferred?).
  • The new generate_builtins_dafsa.zig cannot handle macros that span multiple lines, which there are some of in the .def files (but AFAICT they are only used in new macros that aren't supported like TARGET_BUILTIN)
  • There are new macros that are unsupported: TARGET_BUILTIN, CUSTOM_BUILTIN, and UNALIASED_CUSTOM_BUILTIN
  • There is a new .def file that is skipped: BuiltinsSME.def

The generate script outputs some information about what is skipped and why, here's its output for generating the BuiltinFunction.zig in this PR:

$ zig run scripts/generate_builtins_dafsa.zig --main-pkg-path . -- scripts/data/Builtins*.def
__builtin_arm_prefetch: param_str changed to "!"
__builtin_arm_rsr64: param_str changed to "!"
__builtin_arm_wsr64: param_str changed to "!"
__builtin_is_constant_evaluated: unknown language 'CXX_LANG', skipping
objc_msgSend: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSend_fpret: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSend_fp2ret: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSend_stret: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSendSuper: unknown header 'OBJC_MESSAGE_H', skipping
objc_msgSendSuper_stret: unknown header 'OBJC_MESSAGE_H', skipping
objc_getClass: unknown header 'OBJC_RUNTIME_H', skipping
objc_getMetaClass: unknown header 'OBJC_RUNTIME_H', skipping
objc_enumerationMutation: unknown header 'OBJC_RUNTIME_H', skipping
objc_read_weak: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_assign_weak: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_assign_ivar: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_assign_global: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_assign_strongCast: unknown header 'OBJC_OBJC_AUTO_H', skipping
objc_exception_extract: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_exception_try_enter: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_exception_try_exit: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_exception_match: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_exception_throw: unknown header 'OBJC_OBJC_EXCEPTION_H', skipping
objc_sync_enter: unknown header 'OBJC_OBJC_SYNC_H', skipping
objc_sync_exit: unknown header 'OBJC_OBJC_SYNC_H', skipping
NSLog: unknown header 'FOUNDATION_NSOBJCRUNTIME_H', skipping
NSLogv: unknown header 'FOUNDATION_NSOBJCRUNTIME_H', skipping
addressof: '&' in params str 'v*v&', skipping
__addressof: '&' in params str 'v*v&', skipping
as_const: '&' in params str 'v&v&', skipping
forward: '&' in params str 'v&v&', skipping
forward_like: '&' in params str 'v&v&', skipping
move: '&' in params str 'v&v&', skipping
move_if_noexcept: '&' in params str 'v&v&', skipping
__builtin_addressof: '&' in params str 'v*v&', skipping
__builtin_function_start: '&' in params str 'v*v&', skipping
__builtin_coro_resume: unknown language 'COR_LANG', skipping
__builtin_coro_destroy: unknown language 'COR_LANG', skipping
__builtin_coro_done: unknown language 'COR_LANG', skipping
__builtin_coro_promise: unknown language 'COR_LANG', skipping
__builtin_coro_size: unknown language 'COR_LANG', skipping
__builtin_coro_align: unknown language 'COR_LANG', skipping
__builtin_coro_frame: unknown language 'COR_LANG', skipping
__builtin_coro_noop: unknown language 'COR_LANG', skipping
__builtin_coro_free: unknown language 'COR_LANG', skipping
__builtin_coro_id: unknown language 'COR_LANG', skipping
__builtin_coro_alloc: unknown language 'COR_LANG', skipping
__builtin_coro_begin: unknown language 'COR_LANG', skipping
__builtin_coro_end: unknown language 'COR_LANG', skipping
__builtin_coro_suspend: unknown language 'COR_LANG', skipping
read_pipe: unknown language 'OCL_PIPE', skipping
write_pipe: unknown language 'OCL_PIPE', skipping
reserve_read_pipe: unknown language 'OCL_PIPE', skipping
reserve_write_pipe: unknown language 'OCL_PIPE', skipping
commit_write_pipe: unknown language 'OCL_PIPE', skipping
commit_read_pipe: unknown language 'OCL_PIPE', skipping
sub_group_reserve_read_pipe: unknown language 'OCL_PIPE', skipping
sub_group_reserve_write_pipe: unknown language 'OCL_PIPE', skipping
sub_group_commit_read_pipe: unknown language 'OCL_PIPE', skipping
sub_group_commit_write_pipe: unknown language 'OCL_PIPE', skipping
work_group_reserve_read_pipe: unknown language 'OCL_PIPE', skipping
work_group_reserve_write_pipe: unknown language 'OCL_PIPE', skipping
work_group_commit_read_pipe: unknown language 'OCL_PIPE', skipping
work_group_commit_write_pipe: unknown language 'OCL_PIPE', skipping
get_pipe_num_packets: unknown language 'OCL_PIPE', skipping
get_pipe_max_packets: unknown language 'OCL_PIPE', skipping
enqueue_kernel: unknown language 'OCL_DSE', skipping
get_kernel_work_group_size: unknown language 'OCL_DSE', skipping
get_kernel_preferred_work_group_size_multiple: unknown language 'OCL_DSE', skipping
get_kernel_max_sub_group_size_for_ndrange: unknown language 'OCL_DSE', skipping
get_kernel_sub_group_count_for_ndrange: unknown language 'OCL_DSE', skipping
to_global: unknown language 'OCL_GAS', skipping
to_local: unknown language 'OCL_GAS', skipping
to_private: unknown language 'OCL_GAS', skipping
__builtin_store_half: unknown language 'ALL_OCL_LANGUAGES', skipping
__builtin_store_halff: unknown language 'ALL_OCL_LANGUAGES', skipping
__builtin_load_half: unknown language 'ALL_OCL_LANGUAGES', skipping
__builtin_load_halff: unknown language 'ALL_OCL_LANGUAGES', skipping
__builtin_get_device_side_mangled_name: unknown language 'CUDA_LANG', skipping
__builtin_hlsl_wave_active_count_bits: unknown language 'HLSL_LANG', skipping
__builtin_hlsl_create_handle: unknown language 'HLSL_LANG', skipping
__builtin_ms_va_start: '&' in params str 'vc*&.', skipping
__builtin_ms_va_end: '&' in params str 'vc*&', skipping
__builtin_ms_va_copy: '&' in params str 'vc*&c*&', skipping
unknown target for file 'scripts/data/BuiltinsSME.def', skipping
3948 builtins, max params: 12

@Validark
Copy link

Validark commented Oct 14, 2023

I can't view all the changes from my phone but I'm curious how your manual generation compares to the code that the compiler would generate for you automatically. If you sort the builtins by length (greatest to least) and do:

inline for (builtins) |builtin| {
    if (std.mem.eql(u8, cur_identifier, builtin)) {
        return comptime lookup(<...>);
    }
}
return null;

How good does LLVM do by default?

@squeek502
Copy link
Contributor Author

squeek502 commented Oct 14, 2023

@Validark, that code is basically exactly what std.meta.stringToEnum does for large enums, so here's the benchmark with the enum fields sorted by length:

Lookup time (does name exist?):

------------------- dafsa ------------------
            always found: 83ns per lookup
not found (random bytes): 31ns per lookup
 not found (1 char diff): 83ns per lookup
------------------- enum -------------------
            always found: 1191ns per lookup
not found (random bytes): 68ns per lookup
 not found (1 char diff): 1184ns per lookup

dafsa is 14.3x / 2.2x / 14.3x faster


Lookup time (name -> BuiltinFunction):

------------------- dafsa ------------------
            always found: 154ns per lookup
not found (random bytes): 34ns per lookup
 not found (1 char diff): 156ns per lookup
------------------- enum -------------------
            always found: 1186ns per lookup
not found (random bytes): 68ns per lookup
 not found (1 char diff): 1176ns per lookup

dafsa is 7.7x / 2.0x / 7.5x faster

Considerably better than the unsorted version, but the manually generated DAFSA is still quite a bit faster.

EDIT: This does show that there's an opportunity for an easy stringToEnum optimization (at the cost of compilation time) if the fields were sorted by length before the inline for.

@Validark
Copy link

Sorry for not digging in to the code more before asking. Very interesting! Looks like there's some improvements that could be made to LLVM (and stringToEnum).

Copy link
Owner

@Vexu Vexu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great improvement!

Future improvements for this would be to generate the file as a build step and store a Builtins.def like source of truth instead. The same process could then be used for other enums like Diagnostics.Tag and Attribute.Tag.

Copy link
Owner

@Vexu Vexu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the Tag be changed to a non-exahaustive enum? There also doesn't seem to be a fast path for trying to lookup a name shorter than or longer than exist in the set.

@squeek502
Copy link
Contributor Author

squeek502 commented Oct 14, 2023

Could the Tag be changed to a non-exahaustive enum?

Currently the information required for this is not directly stored anywhere in the generated code. It is possible to get a list of builtins indirectly by iterating over the data array indexes, and then using nameFromTag to get the name from the DAFSA which requires a buffer to write the name into (this is what BuiltinsIterator does).

There also doesn't seem to be a fast path for trying to lookup a name shorter than or longer than exist in the set.

Will add that (there's already a longest_builtin_name const used by BuiltinsIterator for its name buffer). However, shortest will be 3 currently, so unsure if it'll make anything faster.

@Vexu
Copy link
Owner

Vexu commented Oct 14, 2023

By non-exhaustive enum I meant enum(u16) { _ } so that it doesn't cast automatically to other types.

A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but that also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size, but normally does not allow for associating data related to each word. However, by adding a count of the number of possible words from each node, it becomes possible to also use it to achieve minimal perfect hashing for the set of words (which allows going from word -> unique index as well as unique index -> word). This allows us to store a second array of data so that the DAFSA can be used as a lookup for e.g. BuiltinFunction.

Some resources:
- https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
- https://web.archive.org/web/20220722224703/http://pages.pathcom.com/~vadco/dawg.html
- https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.5272 (this is the origin of the minimal perfect hashing technique)
- http://stevehanov.ca/blog/?id=115

What all is going on here:
- `process_builtins.py` was removed in favor of a zig implementation that generates the DAFSA
- The new zig implementation uses the latest .def files from the LLVM repository, so the set of builtins is different and there are some new macro definitions that are not yet supported (e.g. TARGET_BUILTIN), but otherwise the logic matches the python script wherever possible
- The new zig implementation imports `src/builtins/TypeDescription.zig` directly so the calculated max parameter count will always be in sync

The results:

Microbenchmark of lookup time:

    # Does a builtin name exist?:

    ------------------- dafsa ------------------
                always found: 88ns per lookup
    not found (random bytes): 33ns per lookup
     not found (1 char diff): 86ns per lookup
    ------------------- enum -------------------
                always found: 3648ns per lookup
    not found (random bytes): 6310ns per lookup
     not found (1 char diff): 3604ns per lookup

    dafsa is 41.5x / 191.2x / 41.9x faster

    # Name -> BuiltinFunction lookup:

    ------------------- dafsa ------------------
                always found: 154ns per lookup
    not found (random bytes): 32ns per lookup
     not found (1 char diff): 154ns per lookup
    ------------------- enum -------------------
                always found: 3649ns per lookup
    not found (random bytes): 6317ns per lookup
     not found (1 char diff): 3649ns per lookup

    dafsa is 23.7x / 197.4x / 23.7x faster

About 500KiB removed from a ReleaseSmall build of aro:

    1.8M arocc-master
    1.3M arocc-dafsa

    $ bloaty arocc-dafsa -- arocc-master
        FILE SIZE        VM SIZE
     --------------  --------------
      +8.1% +22.9Ki  +8.1% +22.9Ki    .data.rel.ro
      -0.1%      -8  -0.1%      -8    .eh_frame_hdr
      -1.4% -2.41Ki  -1.4% -2.41Ki    .data
     -35.5% -49.4Ki -35.5% -49.4Ki    .eh_frame
     -56.8%  -201Ki -56.8%  -201Ki    .rodata
     -30.4%  -258Ki -30.4%  -258Ki    .text
     -26.9%  -488Ki -26.9%  -488Ki    TOTAL

Compile time of `zig build` is reduced by around a second:

    $ hyperfine "git checkout master; zig build" "git checkout dafsa; zig build" --prepare "rm -rf zig-cache" --runs 3 --warmup 1
    Benchmark Vexu#1: git checkout master; zig build
      Time (mean ± σ):     11.760 s ±  0.024 s    [User: 11.387 s, System: 0.605 s]
      Range (min … max):   11.745 s … 11.788 s    3 runs

    Benchmark Vexu#2: git checkout dafsa; zig build
      Time (mean ± σ):     10.523 s ±  0.016 s    [User: 10.121 s, System: 0.608 s]
      Range (min … max):   10.505 s … 10.534 s    3 runs

    Summary
      'git checkout dafsa; zig build' ran
        1.12 ± 0.00 times faster than 'git checkout master; zig build'

Unit tests run faster:

    Benchmark Vexu#1: ./unit-tests-master
      Time (mean ± σ):     224.9 ms ±   2.6 ms    [User: 210.2 ms, System: 14.5 ms]
      Range (min … max):   220.7 ms … 230.1 ms    13 runs

    Benchmark Vexu#2: ./unit-tests-dafsa
      Time (mean ± σ):      97.4 ms ±   2.6 ms    [User: 83.1 ms, System: 14.2 ms]
      Range (min … max):    94.2 ms … 103.5 ms    30 runs

    Summary
      './unit-tests-dafsa' ran
        2.31 ± 0.07 times faster than './unit-tests-master'

Record tests run slightly slower (not sure why):

    Benchmark Vexu#1: ./record-runner-master test/records
      Time (mean ± σ):     10.283 s ±  0.037 s    [User: 118.281 s, System: 0.261 s]
      Range (min … max):   10.262 s … 10.327 s    3 runs

    Benchmark Vexu#2: ./record-runner-dafsa test/records
      Time (mean ± σ):     10.950 s ±  0.209 s    [User: 123.950 s, System: 0.284 s]
      Range (min … max):   10.796 s … 11.187 s    3 runs

    Summary
      './record-runner-master test/records' ran
        1.06 ± 0.02 times faster than './record-runner-dafsa test/records'

Integration test times are ~unchanged:

    Benchmark Vexu#1: ./test-runner-master /home/ryan/Programming/zig/arocc/test/cases zig
      Time (mean ± σ):     382.8 ms ±   7.2 ms    [User: 300.9 ms, System: 88.0 ms]
      Range (min … max):   373.2 ms … 393.2 ms    10 runs

    Benchmark Vexu#2: ./test-runner-dafsa /home/ryan/Programming/zig/arocc/test/cases zig
      Time (mean ± σ):     392.9 ms ±   3.5 ms    [User: 308.5 ms, System: 90.8 ms]
      Range (min … max):   388.9 ms … 399.7 ms    10 runs

    Summary
      './test-runner-master /home/ryan/Programming/zig/arocc/test/cases zig' ran
        1.03 ± 0.02 times faster than './test-runner-dafsa /home/ryan/Programming/zig/arocc/test/cases zig'

enum + fast path
@squeek502
Copy link
Contributor Author

squeek502 commented Oct 15, 2023

Updated:

  • Tag is now enum(u16) { _ } (this never even crossed my mind as a possibility)
  • Updated --main-pkg-path to --main-mod-path
  • There are fast path length checks in exists and uniqueIndex

The length checks only affect the not found (random bytes) benchmark (since that's the only one where the length could be outside of the shortest...longest range):

Lookup time (does name exist?):

------------------- dafsa ------------------
            always found: 88ns per lookup
not found (random bytes): 24ns per lookup
 not found (1 char diff): 87ns per lookup
------------------- enum -------------------
            always found: 3682ns per lookup
not found (random bytes): 6681ns per lookup
 not found (1 char diff): 3729ns per lookup

dafsa is 41.8x / 278.4x / 42.9x faster


Lookup time (name -> BuiltinFunction):

------------------- dafsa ------------------
            always found: 164ns per lookup
not found (random bytes): 25ns per lookup
 not found (1 char diff): 159ns per lookup
------------------- enum -------------------
            always found: 3741ns per lookup
not found (random bytes): 6919ns per lookup
 not found (1 char diff): 3688ns per lookup

dafsa is 22.8x / 276.8x / 23.2x faster

@Vexu Vexu merged commit 0ef9708 into Vexu:master Oct 15, 2023
3 checks passed
@Validark
Copy link

Validark commented Oct 15, 2023

Did the "fast path" improve actual compile times?

@squeek502
Copy link
Contributor Author

Unsure why you're asking about compile times since the length check is runtime only, but the fast path seems to have had either no effect or a very slightly negative effect on the "always found" and "not found (1 char diff)" benchmarks which always fail the fast path (compare the dafsa results in #524 (comment) vs the OP).

May or may not be worth it, but doesn't seem to provide enough of an obvious downside that it's worth worrying about too much.

@Validark
Copy link

Sorry for the ambiguity, by "compile times" I meant how long it takes to parse and do the pass that needs to identify builtins. You say this "fast path" optimization may only slightly worsen performance, I'm saying that if that's the case it's not worth it as an optimization.

Your branch predictor can hurt you, and it can backfire. If you had half of all identifiers going down the early out "fast path" and an equal number going down the DAFSA, and the order was completely random so the branch predictor cannot "find a pattern" (get lucky), you would expect to guess wrong 50% of the time. A branch mispredict can waste a lot of cycles. Given the very low number of nanoseconds in your measurements, it seems to me a branch mispredict might actually take up, I don't know, 10% of execution time.

In my Accelerated Zig Parser's tokenizer, there is a fast keyword identifier algorithm. I tried adding a "fast path", actually I tried out multiple different early outs, and they all hurt performance. It turned out that the extra work we could avoid with the "fast path" was less than the cost of a mispredict. Since your main path is extremely fast, adding an early out can hurt performance.

@squeek502
Copy link
Contributor Author

I currently don't have a benchmark that will allow me to test it properly in a real-world scenario, would need to set something up for that. I can fully believe the "fast path" is detrimental, and I'm sure there's plenty of other performance gains that can be squeezed out of my implementation, but I'm okay with leaving that on the table for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants