Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide access to the raw byte data for a JsonElement for efficient transcoding of simple, custom types without allocating an intermediate string. #42839

Closed
mwadams opened this issue Sep 29, 2020 · 4 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.Json
Milestone

Comments

@mwadams
Copy link
Contributor

mwadams commented Sep 29, 2020

Background and Motivation

There are a large number of datatypes that need to be implemented as formats of the string type. The canonical example is time handling, but there are many others. For instance, we choose to model our .NET Date/Time types using the NodaTime entities, to better map to the intent of JSON schema. (See this issue for details.)

On the writing side this is straightforward, as we can use Utf8JsonWriter to encode and write the values efficiently.

On the reading side, we can do this efficiently if we drop to Utf8JsonReader; otherwise we have to allocate dotnet strings and then parse them into the target types.

Dropping to Utf8JsonReader means we then we have to do all of the heavy lifting, all the time, when the JsonDocument provides us with a perfectly efficient parse over the higher-level structure and efficient access to properties.

We would like to be able to access to the raw bytes that back the JsonElement, when necessary to efficiently decode an instance of a type like this.

JsonDocument has parsed the data using its internal db/row structure, and decoding of values internally is handled via GetRawValue() which returns a ReadOnlyMemory<byte> based on the given start index, and searching for an end-index.

JsonElement offers GetRawText() which uses JsonDocument.GetRawValue() and then transcodes it to a string.

We would like JsonElement.GetRawValue() which simply eliminates the transcoding to a string.

The downside of this is that people could potentially misuse the content - it has not been unescaped/validated for example - but it would enable a lot of customization scenarios without throwing out all the good (efficient!) work JsonDocument has already done for us

Proposed API

public readonly partial struct JsonElement
{
+       /// <summary>
+       ///   Gets the original input data backing this value, returning it as a <see cref="ReadOnlyMemory{T}"/> of <see cref="byte"/> representing the UTF8 encoded text.
+       /// </summary>
+      /// <returns>
+      ///   The original input data backing this value, returning it as a <see cref="ReadOnlyMemory{T}"/> of <see cref="byte"/>.
+      /// </returns>
+      /// <exception cref="ObjectDisposedException">
+      ///   The parent <see cref="JsonDocument"/> has been disposed.
+      /// </exception>
+      /// <remarks>
+      ///  This provides the raw, escaped, UTF8 encoded text. You should consider using <see cref="GetRawText()"/> if you
+      ///  require an unescaped <see cref="string"/> value.
+      /// </remarks>
+      public ReadOnlyMemory<byte> GetRawValue()
+      {
+          CheckValidInstance();
+
+          return _parent.GetRawValue(_idx);
+      }

Usage Examples

// Example of parsing a value which is structured like "#######-#######".
ReadOnlyMemory<byte> value = myElement.GetRawValue();
int separatorIndex = FindSeparator(Utf8Hyphen, value);
if (Utf8Parser.TryParse(value.Slice(0,separatorIndex).Span, out long firstLong, out int consumedFirst) && consumedFirst == separatorIndex)
{
    if (Utf8Parser.TryParse(value.Slice(separatorIndex + 1).Span, out long secondLong, out int consumedSecond) && consumedSecond == value.Length - (separatorIndex + 1))
    {
        result = new MyPairOfLongs(firstLong, secondLong);
        return true;
    }
}
result = MyPairOfLongs.Empty;
return false;

Alternative Designs

You can always fall back to Utf8JsonReader for this, but that entails dealing with the ValueSequence/ValueSpan and managing the buffers. JsonDocument has already done that for you with this approach.

You could also use WriteTo(Span<byte>) - but that would require you to allocate (and potentially recycle, although that is not guaranteed) a target buffer. Whether or not you need to allocate, it also means making additional copies of data, which is never good for performance

Risks

The chief risk is that people try to use this low-level API without understanding how the underlying UTF8 byte stream actually works. Ensuring that people are pointed at System.Buffers.Text.Utf8Parser should help mitigate this.

@mwadams mwadams added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Sep 29, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Text.Json untriaged New issue has not been triaged by the area owner labels Sep 29, 2020
@mwadams
Copy link
Contributor Author

mwadams commented Sep 29, 2020

It is also possible that people could try to hang on to the ReadOnlyMemory<byte> after the underlying document is disposed (just as they could the JsonElement itself).

@mwadams
Copy link
Contributor Author

mwadams commented Sep 29, 2020

If a synthetic JsonElement were to be created in a future implementation, which was not backed by a JsonDocument (and corresponding slice of memory), then this would become an 'expensive' method (it would need to allocate a buffer and write the text into it) - although, of course, this would still be no worse than the existing GetRawText() method - there would be no "raw text" to get.

However, I don't believe the primary use case for this (mapping to a specific dotnet type from a string-like value) is likely to be using these synthetic types - it is optimising for reading from source. YMMV.

@layomia layomia removed the untriaged New issue has not been triaged by the area owner label Sep 30, 2020
@layomia layomia added this to the Future milestone Sep 30, 2020
@mwadams
Copy link
Contributor Author

mwadams commented Oct 1, 2020

It is also possible that people could try to hang on to the ReadOnlyMemory<byte> after the underlying document is disposed (just as they could the JsonElement itself).

@idg10 has suggested flipping the API to pass a callback which takes a ReadOnlySpan<byte>.

That's definitely worth considering. It would involve allocating a delegate per call, and would also make the API somewhat more complex for what I consider to be a negligible benefit given that we already have the lifetime consideration for the JsonElement itself.

In implementation, I also considered adding the method to JsonProperty (for symmetry) but that does not support this specific use case, so I think that should be a separate change with a separate justification.

@mwadams mwadams mentioned this issue Oct 1, 2020
@eiriktsarpalis
Copy link
Member

Quoting from @bartonjs in an older issue:

That's something we're explicitly keeping out of the API. JsonDocument and JsonElement can apply over UTF-8 or UTF-16 data, exposing the span removes that abstraction.

Related to #54410, we're planning on implementing this for .NET 7.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Text.Json
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants