-
Notifications
You must be signed in to change notification settings - Fork 837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON parser fails on maps with bool keys #6525
Comments
I think this might actually be a feature request to perform string -> bool coercion which we don't currently? I don't think this is specific to MapArray, I'm also not entirely sure what the semantics should be. In particular is |
First, I'm happy to classify this however is useful. However, when I read your response, I wonder if I didn't communicate my issue clearly.
This is actually specific to MapArray. I am not talking about general string -> bool conversion. The issue is that in JSON, the only syntactically valid way to represent a Map<bool, T> is similar to To be clear, I am not asking that a Map<string, bool> accept If you check out my gist, you'll see that arrow_json already does the key-string conversion for ints. To be specific, the only valid way to represent a Map<i32, i32> is like I can certainly see that you might phrase this as "arrow_json does not currently support Maps with bool keys; adding that support is a feature request." In that case, it would be helpful to specify which types are not supported by arrow_json in the documentation, so people know whether it's a bug or a feature request :).
IMO, the semantics should be that when parsing Map(K, V), that the JSON string in the key position is parsed with Does that clarify things? |
I understand what you're saying, however, arrow-json does not distinguish between the case of decoding a boolean from a key and decoding a boolean from a value, the decoders simply operate on the fields - https://github.com/apache/arrow-rs/blob/master/arrow-json/src/reader/map_array.rs#L131. I also personally think it would be rather odd for the decoders to treat these two cases differently. I'm not really sure what the best solution here is, we certainly could support coercing strings to booleans, perhaps gated by a feature flag, but I'm not really sold on this. Perhaps you could expand upon the use-case for this? I can't help feeling a boolean keyed MapArray is a really inefficient way to represent what could simply be a StructArray with two nullable fields of As an aside you raise the point of integers parsing from strings, this is largely because many systems encode integers and floats as strings to get around precision ambiguities in JSON parsers - with some treating all numbers as double-precision floating point. |
Thanks for the clarification, and apologies for a long-winded explanation of something you already understood. As prior art, serde_json (the standard json deserializer) allows "true" in the key position but not the value position: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=dea421d1f8c2075ee9ed791155a516db As for why I care about Map<bool, T>, it's because it's a valid SQL type and I maintain a Rust client for Presto. Presto returns its results in JSON (with a schema) (more details at https://prestodb.io/docs/current/develop/client-protocol.html). When users want the results in Arrow RecordBatches, I use arrow_json to read the JSON into Arrow. Thus I don't control whether the data contains Map<bool, T> (the users do), nor do I control the return format (syntactically valid JSON). Currently Map<bool, T> is supported for non-arrow, but not for arrow, due to the limitation in this issue. I was hoping to bring feature parity to the arrow return format. |
We moved away from serde_json both for performance and to be able to better handle arbitrary precision values. We used to do something similar to this. I think given that |
Thanks for the quick response! Out of curiosity, do you have the benchmarks for the serde_json vs arrow_json performance? It would be interesting to me for my own client. |
As an addendum, I've just had a concrete use case from a user. They were using Presto's |
If someone is willing to implement this and is able to do this without regressing performance, I don't object to it being added |
Describe the bug
JSON objects must have string keys. So when parsing to maps with non-string key types, the string must be coerced into the type. For example, using a schema with a Map(Int32, Int32) type parses rows like
{"map_i32_i32": {"1": 2}}
into a RecordBatch with a single column with a Map(Int32, Int32) type.However, this does not happen for bool key maps. If you try to use a schema with a Map(Boolean, Int32) type, parsing
{"map_bool_i32": {"true": 1}}
raises an error:To Reproduce
Gist https://gist.github.com/jagill/1dd71ff6413002a042089feca89c6eab has a minimal reproduction.
Expected behavior
map_bool_i32
should parse to a Map(Boolean, Int32) array.Additional context
Similar bug report for serde_json, with a link to the fix PR: serde-rs/json#1054
The text was updated successfully, but these errors were encountered: