Skip to content
This repository has been archived by the owner on Mar 20, 2024. It is now read-only.

Update symbol naming scheme to avoid long and duplicate names. #382

Merged
merged 3 commits into from
Oct 20, 2023

Conversation

brson
Copy link
Collaborator

@brson brson commented Oct 5, 2023

This fixes two problems with symbol naming.

Name collisions

Two functions with the same module and function name, but different addresses, will have the same symbol name, and generate a link error. Example

module 0x1::foo {
  public fun a(): u32 {
    2
  }
}

module 0x2::foo {
  public fun a(): u32 {
    2
  }
}

Long symbols

rbpf supports symbol names up to 64 characters (63 + a nil byte). Our current symbol naming will easily generate symbol names that are too long.


This patch uses an encoding scheme that guarantees short and unique symbols. That scheme is described fully in the comments.

Here is an example of a symbol generated by this scheme:

0000000000000010_tests_test_vec_struct_71fWuFLGmmLpqR

It is different from the one suggested in #378 (comment) for a few reasons:

  • Type params need to be included. Here they are just part of the hash, not part of the readable name.
  • I included three visual separators for readability.
  • I stuffed every datum into a single hash instead of multiple hashes. The main downside to this is it is not possible to perfectly identify e.g. which module a symbol is from by looking at a dedicated module hash. The upside is that it allows an arbitrary amount of data to be stashed in the one hash (like the type params) without spending bytes on separate hashes.

I also allocated 15 bytes to each of the module name and the function name. I think it is arguable that the module name is less important and often short compared to the function name, and those bytes could be reduced to add bytes elsewhere. e.g. the hash here is significantly truncated, so bytes could be added to it, but I also don't feel strongly that more bytes elsewhere will meaningfully improve this scheme.

Encoding the address and hash into all symbols ensures that all symbols are fairly long, so there could be concern about binary size. The only symbols names that will appear in the final binary though are ones that need to be relocated, which currently seems to include only public functions. If the compilation model changes in the future, e.g. using LTO to combine compilation units (or just compiling all modules as one compilation unit to begin with), we could possibly avoid those relocations, but I am not sure.

All the work needed to generate symbols here is arguably inefficient and could be cached for later lookups, but with our small workloads I am not thinking it matters. In casual testing rbf-tests takes approximately the same time to execute after this patch.

This uses blake3 for the hash because it is strong and fast, and base58 to encode the hash to valid symbol names.

It leaves alone the naming of the entrypoint symbols as I don't understand that code enough to know if and how it should change.

Fixes #303
Fixes #378

@ksolana ksolana requested review from dmakarov and nvjle and removed request for dmakarov October 6, 2023 15:57
for ty in tyvec {
name += &format!("_{}", ty.display(&self.get_type_display_ctx()))
}
name.replace([':', '<', '>'], "_").replace(", ", "_")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to call replace twice?

Copy link
Collaborator Author

@brson brson Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the first call to replace is replacing any of several single-char patterns with a string. The second case is replace a multi-char string with a string.

Doing both can't be done in one call to replace because the first parameters are two different types (array of char vs. string).

This could probably be written with the input just as an array of strings:

name.replace([":", "<", ">", ", "], "_")

It's (possibly) slightly less efficient because searching for the patterns then requires iterating over strings instead of single chars, though two separate calls to replace could also be less efficient for iterating name twice, and generating an intermediate string (yeah it's probably more efficient to use one replace instead of two).

Note this snippet of code is being removed in this patch.

///
/// The scheme is:
///
/// - 16 bytes - The low 8 bytes of the module address, hex encoded.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't we losing bytes by converting to hex?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. All the human-readable parts of the symbol encoding are lossy and just for human readability. The hash is the only thing in the encoding that ensures uniqueness.

It isn't possible to hex-encode the full module address into symbol names because it would take more than the 63 bytes available to the symbol.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how much readability is lost if we could just print the module address as is. either way is fine if we don't need extra 8 bytes for other purposes.

);
assert!(symbol.len() < 64);

symbol
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does hasher generate a deterministic symbol name? in that case we'd need a test case to check for the name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is deterministic. I'll add a test case.

@dmakarov
Copy link
Collaborator

dmakarov commented Oct 6, 2023

This looks good to me. It seems that RBPF entrypoint tests need to be updated to pass correct entrypoint function name in instruction_data.

@brson brson force-pushed the symbol-names-2 branch 2 times, most recently from 8938ad5 to 4f3eadc Compare October 19, 2023 21:53
@brson
Copy link
Collaborator Author

brson commented Oct 19, 2023

This looks good to me. It seems that RBPF entrypoint tests need to be updated to pass correct entrypoint function name in instruction_data.

This patch doesn't change the names of the entry point functions since I wasn't sure whether or how I should change them, so as of now I don't think any of the tests need to change, unless I'm misunderstanding.

@brson
Copy link
Collaborator Author

brson commented Oct 19, 2023

I've addressed the reviews and fixed the expected IRs to pass CI.

Copy link
Collaborator

@ksolana ksolana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for clarifying my comments.

@brson brson merged commit 84574fc into anza-xyz:llvm-sys Oct 20, 2023
8 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Force move symbols to be less than 64 chars [Bug] Disambiguate symbol names by module address
3 participants