Update symbol naming scheme to avoid long and duplicate names. #382

brson · 2023-10-05T20:58:07Z

This fixes two problems with symbol naming.

Name collisions

Two functions with the same module and function name, but different addresses, will have the same symbol name, and generate a link error. Example

module 0x1::foo {
  public fun a(): u32 {
    2
  }
}

module 0x2::foo {
  public fun a(): u32 {
    2
  }
}

Long symbols

rbpf supports symbol names up to 64 characters (63 + a nil byte). Our current symbol naming will easily generate symbol names that are too long.

This patch uses an encoding scheme that guarantees short and unique symbols. That scheme is described fully in the comments.

Here is an example of a symbol generated by this scheme:

0000000000000010_tests_test_vec_struct_71fWuFLGmmLpqR

It is different from the one suggested in #378 (comment) for a few reasons:

Type params need to be included. Here they are just part of the hash, not part of the readable name.
I included three visual separators for readability.
I stuffed every datum into a single hash instead of multiple hashes. The main downside to this is it is not possible to perfectly identify e.g. which module a symbol is from by looking at a dedicated module hash. The upside is that it allows an arbitrary amount of data to be stashed in the one hash (like the type params) without spending bytes on separate hashes.

I also allocated 15 bytes to each of the module name and the function name. I think it is arguable that the module name is less important and often short compared to the function name, and those bytes could be reduced to add bytes elsewhere. e.g. the hash here is significantly truncated, so bytes could be added to it, but I also don't feel strongly that more bytes elsewhere will meaningfully improve this scheme.

Encoding the address and hash into all symbols ensures that all symbols are fairly long, so there could be concern about binary size. The only symbols names that will appear in the final binary though are ones that need to be relocated, which currently seems to include only public functions. If the compilation model changes in the future, e.g. using LTO to combine compilation units (or just compiling all modules as one compilation unit to begin with), we could possibly avoid those relocations, but I am not sure.

All the work needed to generate symbols here is arguably inefficient and could be cached for later lookups, but with our small workloads I am not thinking it matters. In casual testing rbf-tests takes approximately the same time to execute after this patch.

This uses blake3 for the hash because it is strong and fast, and base58 to encode the hash to valid symbol names.

It leaves alone the naming of the entrypoint symbols as I don't understand that code enough to know if and how it should change.

Fixes #303
Fixes #378

ksolana · 2023-10-06T16:25:06Z

language/solana/move-to-solana/src/stackless/extensions.rs

-            for ty in tyvec {
-                name += &format!("_{}", ty.display(&self.get_type_display_ctx()))
-            }
-            name.replace([':', '<', '>'], "_").replace(", ", "_")


is there a reason to call replace twice?

In this case, the first call to replace is replacing any of several single-char patterns with a string. The second case is replace a multi-char string with a string.

Doing both can't be done in one call to replace because the first parameters are two different types (array of char vs. string).

This could probably be written with the input just as an array of strings:

name.replace([":", "<", ">", ", "], "_")

It's (possibly) slightly less efficient because searching for the patterns then requires iterating over strings instead of single chars, though two separate calls to replace could also be less efficient for iterating name twice, and generating an intermediate string (yeah it's probably more efficient to use one replace instead of two).

Note this snippet of code is being removed in this patch.

ksolana · 2023-10-06T16:28:09Z

language/solana/move-to-solana/src/stackless/extensions.rs

+    ///
+    /// The scheme is:
+    ///
+    /// - 16 bytes - The low 8 bytes of the module address, hex encoded.


aren't we losing bytes by converting to hex?

Yes. All the human-readable parts of the symbol encoding are lossy and just for human readability. The hash is the only thing in the encoding that ensures uniqueness.

It isn't possible to hex-encode the full module address into symbol names because it would take more than the 63 bytes available to the symbol.

not sure how much readability is lost if we could just print the module address as is. either way is fine if we don't need extra 8 bytes for other purposes.

ksolana · 2023-10-06T16:32:09Z

language/solana/move-to-solana/src/stackless/extensions.rs

+        );
+        assert!(symbol.len() < 64);
+
+        symbol


does hasher generate a deterministic symbol name? in that case we'd need a test case to check for the name.

Yes, it is deterministic. I'll add a test case.

dmakarov · 2023-10-06T17:34:33Z

This looks good to me. It seems that RBPF entrypoint tests need to be updated to pass correct entrypoint function name in instruction_data.

brson · 2023-10-19T22:57:48Z

This looks good to me. It seems that RBPF entrypoint tests need to be updated to pass correct entrypoint function name in instruction_data.

This patch doesn't change the names of the entry point functions since I wasn't sure whether or how I should change them, so as of now I don't think any of the tests need to change, unless I'm misunderstanding.

brson · 2023-10-19T22:58:10Z

I've addressed the reviews and fixed the expected IRs to pass CI.

ksolana

LGTM. Thanks for clarifying my comments.

ksolana requested review from dmakarov and nvjle and removed request for dmakarov October 6, 2023 15:57

ksolana reviewed Oct 6, 2023

View reviewed changes

Update symbol naming scheme to avoid long and duplicate names.

f612b3a

brson force-pushed the symbol-names-2 branch from b75bf8e to ebab65d Compare October 19, 2023 20:04

Add test of symbol naming

7c80f1b

brson force-pushed the symbol-names-2 branch 2 times, most recently from 8938ad5 to 4f3eadc Compare October 19, 2023 21:53

Update expected IRs in tests

413784a

brson force-pushed the symbol-names-2 branch from 4f3eadc to 413784a Compare October 19, 2023 22:19

dmakarov approved these changes Oct 19, 2023

View reviewed changes

ksolana approved these changes Oct 20, 2023

View reviewed changes

brson merged commit 84574fc into anza-xyz:llvm-sys Oct 20, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update symbol naming scheme to avoid long and duplicate names. #382

Update symbol naming scheme to avoid long and duplicate names. #382

brson commented Oct 5, 2023

ksolana Oct 6, 2023

brson Oct 19, 2023 •

edited

Loading

ksolana Oct 6, 2023

brson Oct 19, 2023

ksolana Oct 20, 2023

ksolana Oct 6, 2023

brson Oct 19, 2023

dmakarov commented Oct 6, 2023

brson commented Oct 19, 2023

brson commented Oct 19, 2023

ksolana left a comment

Update symbol naming scheme to avoid long and duplicate names. #382

Update symbol naming scheme to avoid long and duplicate names. #382

Conversation

brson commented Oct 5, 2023

Name collisions

Long symbols

ksolana Oct 6, 2023

Choose a reason for hiding this comment

brson Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

ksolana Oct 6, 2023

Choose a reason for hiding this comment

brson Oct 19, 2023

Choose a reason for hiding this comment

ksolana Oct 20, 2023

Choose a reason for hiding this comment

ksolana Oct 6, 2023

Choose a reason for hiding this comment

brson Oct 19, 2023

Choose a reason for hiding this comment

dmakarov commented Oct 6, 2023

brson commented Oct 19, 2023

brson commented Oct 19, 2023

ksolana left a comment

Choose a reason for hiding this comment

brson Oct 19, 2023 •

edited

Loading