-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Yaml serialization includes std::filesystem::path and vector<uint8_t> #22277
base: master
Are you sure you want to change the base?
Yaml serialization includes std::filesystem::path and vector<uint8_t> #22277
Conversation
Path member can now be serialized like other basic member types. Serializing something of type vector<uint8_t> is interpreted as a byte string. If all bytes are printable, it is equivalent to a string. If there are any non-printable characters, it is converted to base64 and prefixed with YAML's !!binary tag. MemoryFile makes use of this to serialize its file contents (it creates a temporary vector<uint8_t> for serialization to trigger the potential binary yaml encoding. TODO: 1. I need unit tests in python. 2. Doing: `content: !binary lskdjflskdjf` is different in python vs C++. - python complains that there is no constructor for !binary. - C++ ignores !binary. Resolve this discrepancy when operating on undefined local tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)
a discussion (no related file):
@jwnimmer-tri I'd like you to take a look at this.
First, while cpp_yaml has Binary
it is not a broadly supported type. It can only be used directly with an Emitter
and not the EmitFromEvents
we use. The reading has similar issues. Seeing the global !!binary
tag does not lead to the creation of a Binary
(or automatic decoding). So, the mapping is provided by hand in Drake's read/write io.
Several open questions I'd like feedback on.
- Are we good with the logic regarding
std::vector<uint8_t>
as maybe producing!!binary ....
? - Where should that get documented?
- If there's a typo on the
!!binary
tag (e.g., a local tag!binary
) our YAML functionality behaves quite differently from, eg., pyyaml. Pyyaml complains it has no constructor for!binary
. We simply ignore it (treating it as if it weren't there). - Can you elaborate on what particular testing in python you'd like to see. I haven't implemented the straightforward stuff yet (it'll come tomorrow). But I got the impression from our conversation that there's something more nuanced I haven't considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)
a discussion (no related file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
@jwnimmer-tri I'd like you to take a look at this.
First, while cpp_yaml has
Binary
it is not a broadly supported type. It can only be used directly with anEmitter
and not theEmitFromEvents
we use. The reading has similar issues. Seeing the global!!binary
tag does not lead to the creation of aBinary
(or automatic decoding). So, the mapping is provided by hand in Drake's read/write io.Several open questions I'd like feedback on.
- Are we good with the logic regarding
std::vector<uint8_t>
as maybe producing!!binary ....
?- Where should that get documented?
- If there's a typo on the
!!binary
tag (e.g., a local tag!binary
) our YAML functionality behaves quite differently from, eg., pyyaml. Pyyaml complains it has no constructor for!binary
. We simply ignore it (treating it as if it weren't there).- Can you elaborate on what particular testing in python you'd like to see. I haven't implemented the straightforward stuff yet (it'll come tomorrow). But I got the impression from our conversation that there's something more nuanced I haven't considered.
I suspect the answer to (2) is probably yaml_doxygen.h
, so I'll also peruse that at the same time I'm dealing with the python tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 13 files at r1.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)
a discussion (no related file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
I suspect the answer to (2) is probably
yaml_doxygen.h
, so I'll also peruse that at the same time I'm dealing with the python tests.
(1) I did some study, and this is actually a long discussion. Let's have a call about it? The two basic questions are:
(a) When reading yaml and we encounter a !!str
but our Python field schema says bytes
, is that an error or should we automatically call value.encode('utf-8')
to convert the unicode string into bytes
as a convenience? This a question of what semantics we want our loaders to obey. Similarly if the scalar has the non-specific tag ?
then (how) do we resolve the tag?
(b) How does C++ signal to the yaml writer that a certain field is an octet stream? I don't think std::vector<uint8_t>
can cut it -- we need to allow for actual lists of (small) integers to output as a yaml !!seq
of !!int
. The options are either to key on std::vector<std::byte>
as the static type, or to add a little flag to MakeNameValue
to say that that the T = std::string
is not unicode.
(3) If this is not unique to !binary
(and I imagine it's not), we can defer it for another PR. Our error-checking of assigned tags vs scalar fields in out structs (& dataclasses) is probably lacking.
(4) For yaml stuff, it's actually best to write the test suite first and the implementation second. The suite should be roughly the same for C++ and python, and cover all of the relevant and interesting cases in the yaml specs for the types we're working with.
We'll need to agree on the full contract of (1a) first before we can write out that test matrix.
common/test/memory_file_test.cc
line 126 at r1 (raw file):
/** Confirm that this can be serialized appropriately. */ GTEST_TEST(MemoryFileTest, Serialization) { // Emtpy file.
typo
common/test/memory_file_test.cc
line 181 at r1 (raw file):
// Invalid characters in base64 are not ignored; they become zeros in the // decoding. We'll put *four* invalid characters in; that leads to exactly // five null bytes before we get back into the expected string.
Skimming the yaml-cpp code, its YAML::DecodeBase64
function does seem to detect errors properly, so we should probably just use that instead of CRU?
common/test/memory_file_test.cc
line 181 at r1 (raw file):
// Invalid characters in base64 are not ignored; they become zeros in the // decoding. We'll put *four* invalid characters in; that leads to exactly // five null bytes before we get back into the expected string.
In any case, testing yaml error handling details belongs in the yaml_read test(s) in drake/common, not here. (Maybe this was just a temporary home.)
common/yaml/yaml_node.h
line 174 at r1 (raw file):
static constexpr std::string_view kTagStr{"tag:yaml.org,2002:str"}; // https://yaml.org/spec/1.2.2/#24-tags (see Example 2.23).
The example is good, but isn't really a specification. I think we should link to either only the spec (https://yaml.org/type/binary.html) or else both the spec and the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes
a discussion (no related file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
(1) I did some study, and this is actually a long discussion. Let's have a call about it? The two basic questions are:
(a) When reading yaml and we encounter a
!!str
but our Python field schema saysbytes
, is that an error or should we automatically callvalue.encode('utf-8')
to convert the unicode string intobytes
as a convenience? This a question of what semantics we want our loaders to obey. Similarly if the scalar has the non-specific tag?
then (how) do we resolve the tag?(b) How does C++ signal to the yaml writer that a certain field is an octet stream? I don't think
std::vector<uint8_t>
can cut it -- we need to allow for actual lists of (small) integers to output as a yaml!!seq
of!!int
. The options are either to key onstd::vector<std::byte>
as the static type, or to add a little flag toMakeNameValue
to say that that theT = std::string
is not unicode.
(3) If this is not unique to
!binary
(and I imagine it's not), we can defer it for another PR. Our error-checking of assigned tags vs scalar fields in out structs (& dataclasses) is probably lacking.
(4) For yaml stuff, it's actually best to write the test suite first and the implementation second. The suite should be roughly the same for C++ and python, and cover all of the relevant and interesting cases in the yaml specs for the types we're working with.
We'll need to agree on the full contract of (1a) first before we can write out that test matrix.
For now I'm going to do the following:
- Strip out
std::filesystem::path
support into its own PR. - Make
MemoryFile::Serialize()
simply throw for now. We can resolve the question of how to serialize its contents later.
common/test/memory_file_test.cc
line 181 at r1 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
Skimming the yaml-cpp code, its
YAML::DecodeBase64
function does seem to detect errors properly, so we should probably just use that instead of CRU?
Right. In that case, it simply returns a zero-length vector.
common/test/memory_file_test.cc
line 181 at r1 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
In any case, testing yaml error handling details belongs in the yaml_read test(s) in drake/common, not here. (Maybe this was just a temporary home.)
A relic from when the encoding/decoding happened in MemoryFile
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes
a discussion (no related file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
For now I'm going to do the following:
- Strip out
std::filesystem::path
support into its own PR.- Make
MemoryFile::Serialize()
simply throw for now. We can resolve the question of how to serialize its contents later.
Take a look at #22298. It's a simplified version of handling strings with "unprintable" characters. That and #22288 supplant this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes
a discussion (no related file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
Take a look at #22298. It's a simplified version of handling strings with "unprintable" characters. That and #22288 supplant this PR.
P.S. I'm leaving this open for a day or so, so if there were any comments made here that should be ported there, it's more easily accessible. But I anticipate simply closing this out by the end of the week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)
a discussion (no related file):
Previously, SeanCurtis-TRI (Sean Curtis) wrote…
P.S. I'm leaving this open for a day or so, so if there were any comments made here that should be ported there, it's more easily accessible. But I anticipate simply closing this out by the end of the week.
I'm not sure the other PR is any better for discussion than this one. The only next step is for you to call me so I can explain how this is all supposed to work, drawing from the ideas upthread here. None of your existing proposals are viable.
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
I hoped I'd simplified it to work nicely. The idea was that using In C++ it's simple because the only type that gets passed to the static types is The writing is really just frosting. Representing a string in yaml defaults to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 5 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers, missing label for release notes (waiting on @SeanCurtis-TRI)
Path member can now be serialized like other basic member types.
Serializing something of type vector<uint8_t> is interpreted as a byte string. If all bytes are printable, it is equivalent to a string. If there are any non-printable characters, it is converted to base64 and prefixed with YAML's !!binary tag.
MemoryFile makes use of this to serialize its file contents (it creates a temporary vector<uint8_t> for serialization to trigger the potential binary yaml encoding.
This change is