Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yaml serialization includes std::filesystem::path and vector<uint8_t> #22277

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

SeanCurtis-TRI
Copy link
Contributor

@SeanCurtis-TRI SeanCurtis-TRI commented Dec 5, 2024

Path member can now be serialized like other basic member types.

Serializing something of type vector<uint8_t> is interpreted as a byte string. If all bytes are printable, it is equivalent to a string. If there are any non-printable characters, it is converted to base64 and prefixed with YAML's !!binary tag.

MemoryFile makes use of this to serialize its file contents (it creates a temporary vector<uint8_t> for serialization to trigger the potential binary yaml encoding.


This change is Reviewable

Path member can now be serialized like other basic member types.

Serializing something of type vector<uint8_t> is interpreted as a byte
string. If all bytes are printable, it is equivalent to a string. If there
are any non-printable characters, it is converted to base64 and prefixed
with YAML's !!binary tag.

MemoryFile makes use of this to serialize its file contents (it creates
a temporary vector<uint8_t> for serialization to trigger the potential
binary yaml encoding.

TODO:
 1. I need unit tests in python.
 2. Doing: `content: !binary lskdjflskdjf` is different in python vs C++.
    - python complains that there is no constructor for !binary.
    - C++ ignores !binary.
    Resolve this discrepancy when operating on undefined local tags.
Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):
@jwnimmer-tri I'd like you to take a look at this.

First, while cpp_yaml has Binary it is not a broadly supported type. It can only be used directly with an Emitter and not the EmitFromEvents we use. The reading has similar issues. Seeing the global !!binary tag does not lead to the creation of a Binary (or automatic decoding). So, the mapping is provided by hand in Drake's read/write io.

Several open questions I'd like feedback on.

  1. Are we good with the logic regarding std::vector<uint8_t> as maybe producing !!binary ....?
  2. Where should that get documented?
  3. If there's a typo on the !!binary tag (e.g., a local tag !binary) our YAML functionality behaves quite differently from, eg., pyyaml. Pyyaml complains it has no constructor for !binary. We simply ignore it (treating it as if it weren't there).
  4. Can you elaborate on what particular testing in python you'd like to see. I haven't implemented the straightforward stuff yet (it'll come tomorrow). But I got the impression from our conversation that there's something more nuanced I haven't considered.

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

@jwnimmer-tri I'd like you to take a look at this.

First, while cpp_yaml has Binary it is not a broadly supported type. It can only be used directly with an Emitter and not the EmitFromEvents we use. The reading has similar issues. Seeing the global !!binary tag does not lead to the creation of a Binary (or automatic decoding). So, the mapping is provided by hand in Drake's read/write io.

Several open questions I'd like feedback on.

  1. Are we good with the logic regarding std::vector<uint8_t> as maybe producing !!binary ....?
  2. Where should that get documented?
  3. If there's a typo on the !!binary tag (e.g., a local tag !binary) our YAML functionality behaves quite differently from, eg., pyyaml. Pyyaml complains it has no constructor for !binary. We simply ignore it (treating it as if it weren't there).
  4. Can you elaborate on what particular testing in python you'd like to see. I haven't implemented the straightforward stuff yet (it'll come tomorrow). But I got the impression from our conversation that there's something more nuanced I haven't considered.

I suspect the answer to (2) is probably yaml_doxygen.h, so I'll also peruse that at the same time I'm dealing with the python tests.

@jwnimmer-tri jwnimmer-tri self-assigned this Dec 5, 2024
Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 13 files at r1.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

I suspect the answer to (2) is probably yaml_doxygen.h, so I'll also peruse that at the same time I'm dealing with the python tests.

(1) I did some study, and this is actually a long discussion. Let's have a call about it? The two basic questions are:

(a) When reading yaml and we encounter a !!str but our Python field schema says bytes, is that an error or should we automatically call value.encode('utf-8') to convert the unicode string into bytes as a convenience? This a question of what semantics we want our loaders to obey. Similarly if the scalar has the non-specific tag ? then (how) do we resolve the tag?

(b) How does C++ signal to the yaml writer that a certain field is an octet stream? I don't think std::vector<uint8_t> can cut it -- we need to allow for actual lists of (small) integers to output as a yaml !!seq of !!int. The options are either to key on std::vector<std::byte> as the static type, or to add a little flag to MakeNameValue to say that that the T = std::string is not unicode.


(3) If this is not unique to !binary (and I imagine it's not), we can defer it for another PR. Our error-checking of assigned tags vs scalar fields in out structs (& dataclasses) is probably lacking.


(4) For yaml stuff, it's actually best to write the test suite first and the implementation second. The suite should be roughly the same for C++ and python, and cover all of the relevant and interesting cases in the yaml specs for the types we're working with.

We'll need to agree on the full contract of (1a) first before we can write out that test matrix.


common/test/memory_file_test.cc line 126 at r1 (raw file):

/** Confirm that this can be serialized appropriately. */
GTEST_TEST(MemoryFileTest, Serialization) {
  // Emtpy file.

typo


common/test/memory_file_test.cc line 181 at r1 (raw file):

  // Invalid characters in base64 are not ignored; they become zeros in the
  // decoding. We'll put *four* invalid characters in; that leads to exactly
  // five null bytes before we get back into the expected string.

Skimming the yaml-cpp code, its YAML::DecodeBase64 function does seem to detect errors properly, so we should probably just use that instead of CRU?


common/test/memory_file_test.cc line 181 at r1 (raw file):

  // Invalid characters in base64 are not ignored; they become zeros in the
  // decoding. We'll put *four* invalid characters in; that leads to exactly
  // five null bytes before we get back into the expected string.

In any case, testing yaml error handling details belongs in the yaml_read test(s) in drake/common, not here. (Maybe this was just a temporary home.)


common/yaml/yaml_node.h line 174 at r1 (raw file):

  static constexpr std::string_view kTagStr{"tag:yaml.org,2002:str"};

  // https://yaml.org/spec/1.2.2/#24-tags (see Example 2.23).

The example is good, but isn't really a specification. I think we should link to either only the spec (https://yaml.org/type/binary.html) or else both the spec and the example.

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes


a discussion (no related file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

(1) I did some study, and this is actually a long discussion. Let's have a call about it? The two basic questions are:

(a) When reading yaml and we encounter a !!str but our Python field schema says bytes, is that an error or should we automatically call value.encode('utf-8') to convert the unicode string into bytes as a convenience? This a question of what semantics we want our loaders to obey. Similarly if the scalar has the non-specific tag ? then (how) do we resolve the tag?

(b) How does C++ signal to the yaml writer that a certain field is an octet stream? I don't think std::vector<uint8_t> can cut it -- we need to allow for actual lists of (small) integers to output as a yaml !!seq of !!int. The options are either to key on std::vector<std::byte> as the static type, or to add a little flag to MakeNameValue to say that that the T = std::string is not unicode.


(3) If this is not unique to !binary (and I imagine it's not), we can defer it for another PR. Our error-checking of assigned tags vs scalar fields in out structs (& dataclasses) is probably lacking.


(4) For yaml stuff, it's actually best to write the test suite first and the implementation second. The suite should be roughly the same for C++ and python, and cover all of the relevant and interesting cases in the yaml specs for the types we're working with.

We'll need to agree on the full contract of (1a) first before we can write out that test matrix.

For now I'm going to do the following:

  1. Strip out std::filesystem::path support into its own PR.
  2. Make MemoryFile::Serialize() simply throw for now. We can resolve the question of how to serialize its contents later.

common/test/memory_file_test.cc line 181 at r1 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Skimming the yaml-cpp code, its YAML::DecodeBase64 function does seem to detect errors properly, so we should probably just use that instead of CRU?

Right. In that case, it simply returns a zero-length vector.


common/test/memory_file_test.cc line 181 at r1 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

In any case, testing yaml error handling details belongs in the yaml_read test(s) in drake/common, not here. (Maybe this was just a temporary home.)

A relic from when the encoding/decoding happened in MemoryFile.

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes


a discussion (no related file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

For now I'm going to do the following:

  1. Strip out std::filesystem::path support into its own PR.
  2. Make MemoryFile::Serialize() simply throw for now. We can resolve the question of how to serialize its contents later.

Take a look at #22298. It's a simplified version of handling strings with "unprintable" characters. That and #22288 supplant this PR.

Copy link
Contributor Author

@SeanCurtis-TRI SeanCurtis-TRI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes


a discussion (no related file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

Take a look at #22298. It's a simplified version of handling strings with "unprintable" characters. That and #22288 supplant this PR.

P.S. I'm leaving this open for a day or so, so if there were any comments made here that should be ported there, it's more easily accessible. But I anticipate simply closing this out by the end of the week.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)


a discussion (no related file):

Previously, SeanCurtis-TRI (Sean Curtis) wrote…

P.S. I'm leaving this open for a day or so, so if there were any comments made here that should be ported there, it's more easily accessible. But I anticipate simply closing this out by the end of the week.

I'm not sure the other PR is any better for discussion than this one. The only next step is for you to call me so I can explain how this is all supposed to work, drawing from the ideas upthread here. None of your existing proposals are viable.

@SeanCurtis-TRI
Copy link
Contributor Author

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

I'm not sure the other PR is any better for discussion than this one. The only next step is for you to call me so I can explain how this is all supposed to work, drawing from the ideas upthread here. None of your existing proposals are viable.

I hoped I'd simplified it to work nicely. The idea was that using !!binary was just a representation thing. Every value in yaml is fundamentally a string. The goal was to give an alternate string encoding in the yaml. And between parsing the yaml and assigning the type, we'd convert !!binary to the equivalent "string" and then the types would deal with the result.

In C++ it's simple because the only type that gets passed to the static types is std::string and the only time we convert from string is when assigning it to a static type. It's murkier in python because the yaml "value" may be a str and may be bytes. And some values get converted prior to Drake's typed yaml code. I still hoped the logic for mapping str/bytes to type would happen in the Drake code. But reading is always the challenge.

The writing is really just frosting. Representing a string in yaml defaults to !!binary if it has "ugly" bytes. Obviously, I'm still missing something fundamental. So, we'll chat. I'm available first thing in the morning but will be afk from about 7:15-8:30 while I get my daughter to school.

Copy link
Collaborator

@jwnimmer-tri jwnimmer-tri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants