-
Notifications
You must be signed in to change notification settings - Fork 32
Update symbol naming scheme to avoid long and duplicate names. #382
Conversation
for ty in tyvec { | ||
name += &format!("_{}", ty.display(&self.get_type_display_ctx())) | ||
} | ||
name.replace([':', '<', '>'], "_").replace(", ", "_") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason to call replace
twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the first call to replace is replacing any of several single-char patterns with a string. The second case is replace a multi-char string with a string.
Doing both can't be done in one call to replace because the first parameters are two different types (array of char vs. string).
This could probably be written with the input just as an array of strings:
name.replace([":", "<", ">", ", "], "_")
It's (possibly) slightly less efficient because searching for the patterns then requires iterating over strings instead of single chars, though two separate calls to replace
could also be less efficient for iterating name
twice, and generating an intermediate string (yeah it's probably more efficient to use one replace
instead of two).
Note this snippet of code is being removed in this patch.
/// | ||
/// The scheme is: | ||
/// | ||
/// - 16 bytes - The low 8 bytes of the module address, hex encoded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aren't we losing bytes by converting to hex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. All the human-readable parts of the symbol encoding are lossy and just for human readability. The hash is the only thing in the encoding that ensures uniqueness.
It isn't possible to hex-encode the full module address into symbol names because it would take more than the 63 bytes available to the symbol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how much readability is lost if we could just print the module address as is. either way is fine if we don't need extra 8 bytes for other purposes.
); | ||
assert!(symbol.len() < 64); | ||
|
||
symbol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does hasher generate a deterministic symbol name? in that case we'd need a test case to check for the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is deterministic. I'll add a test case.
This looks good to me. It seems that RBPF entrypoint tests need to be updated to pass correct entrypoint function name in instruction_data. |
8938ad5
to
4f3eadc
Compare
This patch doesn't change the names of the entry point functions since I wasn't sure whether or how I should change them, so as of now I don't think any of the tests need to change, unless I'm misunderstanding. |
I've addressed the reviews and fixed the expected IRs to pass CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for clarifying my comments.
This fixes two problems with symbol naming.
Name collisions
Two functions with the same module and function name, but different addresses, will have the same symbol name, and generate a link error. Example
Long symbols
rbpf supports symbol names up to 64 characters (63 + a nil byte). Our current symbol naming will easily generate symbol names that are too long.
This patch uses an encoding scheme that guarantees short and unique symbols. That scheme is described fully in the comments.
Here is an example of a symbol generated by this scheme:
It is different from the one suggested in #378 (comment) for a few reasons:
I also allocated 15 bytes to each of the module name and the function name. I think it is arguable that the module name is less important and often short compared to the function name, and those bytes could be reduced to add bytes elsewhere. e.g. the hash here is significantly truncated, so bytes could be added to it, but I also don't feel strongly that more bytes elsewhere will meaningfully improve this scheme.
Encoding the address and hash into all symbols ensures that all symbols are fairly long, so there could be concern about binary size. The only symbols names that will appear in the final binary though are ones that need to be relocated, which currently seems to include only public functions. If the compilation model changes in the future, e.g. using LTO to combine compilation units (or just compiling all modules as one compilation unit to begin with), we could possibly avoid those relocations, but I am not sure.
All the work needed to generate symbols here is arguably inefficient and could be cached for later lookups, but with our small workloads I am not thinking it matters. In casual testing rbf-tests takes approximately the same time to execute after this patch.
This uses blake3 for the hash because it is strong and fast, and base58 to encode the hash to valid symbol names.
It leaves alone the naming of the entrypoint symbols as I don't understand that code enough to know if and how it should change.
Fixes #303
Fixes #378