-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Int with bitWidth=64 not supported in parquet reader ? #4788
Comments
Looks like this is unsigned, 64-bit. Supporting it would require reading into a Java |
That said, I think compromised support is better than no support. @jonathandune do you have a preference for your use case? |
https://guava.dev/releases/20.0/api/docs/com/google/common/primitives/UnsignedLong.html is a 3rd party option. |
@jonathandune Can you point us explicitly to the specific example parquet file you seem to be referencing in the description? Thanks! |
I'll give some context here in case it helps. Our engine is implemented in Java, which makes it hard for us to support unsigned types in general. This is possible, but is not something we currently support and would require considerable effort. We are not alone here btw, eg, this is a common theme for Java based data engines: https://issues.apache.org/jira/browse/SPARK-34816. Depending on your use case, a short term option is to load the data to a different type supporting the full range (eg, BigInteger or BigDecimal). We may be able to help with that, depending on your actual use case (can you provide an example file?). Also I want to mention, if your data is using unsigned as a matter of semantics but you happen to know from your use case that you are not using the upper range of unsigned (ie, the unsigned values that won't fit on the signed type of the same bit width), then a much simpler alternative is to read the data as signed and just operate on it as that, signed. YMMV here depending on use case. If you do use the upper range, a different alternative to BigInteger or Bigdecimal would be to still load as signed but consistently use the Guava library's UnsignedLong (linked by Ryan in an earlier comment) to wrap all operations (eg, if you need a column operation adding two values, it would need to be wrapped through calls to functions in that library) and presentation (eg, you would need to convert to String for presentation using that library). This would be a lot more "heavy lifting" and a tax on everything you do for a column of this type, so this path is not advisable unless you really need that upper range and you can't live with just the positive range of the signed type. |
Ooops sorry I forgot to share the url of the parquet file !! |
The context of the use case is blockchain data in parquet files. |
Maybe crazy idea: if those long bit width integers (eg 256 as your example above) are used as IDs and not to actually do math on them, you may be better off loading them in DH as a byte array, or even more manageable, a hex string (or even shorter, base64 string). I think as a (sane, uniquely defined for this) hex or base64 string, both equality and inequality/range operations would work, which should suffice for id like data. 🤷♂️ To do this on the 64 bit case, you would first load the data as signed, and then use java code that is careful to look at the column data "as unsigned" to compute a new table with the new column. But what I think is interesting here is, if you already have to deal with 256 bit integers, and since parquet doesn't support that, maybe you are already dealing with those 256 bit integers in some other way. Whatever you are doing there potentially would make sense to do for 64 bit integers too, for consistency. |
Hi @jonathandune, tagging you here so we don't lose track of the conversation. |
There are cases where arithmetic will be required on those types so I think going with something standard like BigInteger might make sense ? |
Hi @jonathandune, thanks for the reply. I will check internally and share the next steps with you soon. |
I think BigInteger should be the default for this case; ideally, I'd like the ability to configure on a column-by-column basis to change the desired behavior. I could imagine some users preferring to keep as |
Here's a public dataset that demonstrates the same issue: from deephaven import parquet
from deephaven.experimental import s3
from datetime import timedelta
amazon_reviews = parquet.read(
"s3://datasets-documentation/amazon_reviews/",
special_instructions=s3.S3Instructions(
"eu-west-3",
anonymous_access=True,
read_ahead_count=8,
fragment_size=65536,
read_timeout=timedelta(seconds=10),
),
) |
Description
The parquet reader in io.deephaven.parquet.table.ParquetSchemaReader doesn't seem to handle certain integer types.
Steps to reproduce
ERROR r-Scheduler-Serial-1 | .c.ConsoleServiceGrpcImpl | Error running script: java.lang.RuntimeException: Error in Python interpreter:
Type: <class 'deephaven.dherror.DHError'>
Value: failed to read parquet data. : RuntimeError: io.deephaven.UncheckedDeephavenException: Unable to read column [gas_used]: IntLogicalType, isSigned=false, bitWidth=64 not supported
Traceback (most recent call last):
File "/opt/deephaven/venv/lib/python3.10/site-packages/deephaven/parquet.py", line 121, in read
return Table(j_table=_JParquetTools.readTable(path))
RuntimeError: io.deephaven.UncheckedDeephavenException: Unable to read column [gas_used]: IntLogicalType, isSigned=false, bitWidth=64 not supported
at io.deephaven.parquet.table.ParquetSchemaReader.lambda$readParquetSchema$2(ParquetSchemaReader.java:274)
at java.base/java.util.Optional.orElseThrow(Optional.java:403)
at io.deephaven.parquet.table.ParquetSchemaReader.readParquetSchema(ParquetSchemaReader.java:268)
at io.deephaven.parquet.table.ParquetTools.convertSchema(ParquetTools.java:842)
at io.deephaven.parquet.table.ParquetTools.readTableInternal(ParquetTools.java:551)
at io.deephaven.parquet.table.ParquetTools.readTable(ParquetTools.java:80)
Expected results
I expected the parquet file to be read
Versions
The text was updated successfully, but these errors were encountered: