Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Int with bitWidth=64 not supported in parquet reader ? #4788

Open
jonathandune opened this issue Nov 7, 2023 · 13 comments
Open

Int with bitWidth=64 not supported in parquet reader ? #4788

jonathandune opened this issue Nov 7, 2023 · 13 comments
Assignees
Labels
bug Something isn't working core Core development tasks parquet Related to the Parquet integration

Comments

@jonathandune
Copy link

Description

The parquet reader in io.deephaven.parquet.table.ParquetSchemaReader doesn't seem to handle certain integer types.

Steps to reproduce

  1. Download the example parquet file
  2. Execute source = read('/tmp/file.parquet')
  3. Logs will show :

ERROR r-Scheduler-Serial-1 | .c.ConsoleServiceGrpcImpl | Error running script: java.lang.RuntimeException: Error in Python interpreter:
Type: <class 'deephaven.dherror.DHError'>
Value: failed to read parquet data. : RuntimeError: io.deephaven.UncheckedDeephavenException: Unable to read column [gas_used]: IntLogicalType, isSigned=false, bitWidth=64 not supported
Traceback (most recent call last):
File "/opt/deephaven/venv/lib/python3.10/site-packages/deephaven/parquet.py", line 121, in read
return Table(j_table=_JParquetTools.readTable(path))
RuntimeError: io.deephaven.UncheckedDeephavenException: Unable to read column [gas_used]: IntLogicalType, isSigned=false, bitWidth=64 not supported
at io.deephaven.parquet.table.ParquetSchemaReader.lambda$readParquetSchema$2(ParquetSchemaReader.java:274)
at java.base/java.util.Optional.orElseThrow(Optional.java:403)
at io.deephaven.parquet.table.ParquetSchemaReader.readParquetSchema(ParquetSchemaReader.java:268)
at io.deephaven.parquet.table.ParquetTools.convertSchema(ParquetTools.java:842)
at io.deephaven.parquet.table.ParquetTools.readTableInternal(ParquetTools.java:551)
at io.deephaven.parquet.table.ParquetTools.readTable(ParquetTools.java:80)

Expected results

I expected the parquet file to be read

Versions

  • Deephaven: latest docker image
  • OS: mac M1
@jonathandune jonathandune added bug Something isn't working triage labels Nov 7, 2023
@rcaudy rcaudy added core Core development tasks parquet Related to the Parquet integration and removed triage labels Nov 7, 2023
@rcaudy
Copy link
Member

rcaudy commented Nov 7, 2023

Looks like this is unsigned, 64-bit. Supporting it would require reading into a Java BigInteger or some third-party type, both unappealing options for a high performance library. The alternative of reading into a signed long seems just plain incorrect.

@rcaudy
Copy link
Member

rcaudy commented Nov 7, 2023

That said, I think compromised support is better than no support. @jonathandune do you have a preference for your use case? BigInteger is probably the most natural answer here.

@rcaudy
Copy link
Member

rcaudy commented Nov 7, 2023

@jcferretti
Copy link
Member

@jonathandune Can you point us explicitly to the specific example parquet file you seem to be referencing in the description? Thanks!

@jcferretti
Copy link
Member

jcferretti commented Nov 7, 2023

I'll give some context here in case it helps. Our engine is implemented in Java, which makes it hard for us to support unsigned types in general. This is possible, but is not something we currently support and would require considerable effort. We are not alone here btw, eg, this is a common theme for Java based data engines: https://issues.apache.org/jira/browse/SPARK-34816.

Depending on your use case, a short term option is to load the data to a different type supporting the full range (eg, BigInteger or BigDecimal). We may be able to help with that, depending on your actual use case (can you provide an example file?). Also I want to mention, if your data is using unsigned as a matter of semantics but you happen to know from your use case that you are not using the upper range of unsigned (ie, the unsigned values that won't fit on the signed type of the same bit width), then a much simpler alternative is to read the data as signed and just operate on it as that, signed. YMMV here depending on use case.

If you do use the upper range, a different alternative to BigInteger or Bigdecimal would be to still load as signed but consistently use the Guava library's UnsignedLong (linked by Ryan in an earlier comment) to wrap all operations (eg, if you need a column operation adding two values, it would need to be wrapped through calls to functions in that library) and presentation (eg, you would need to convert to String for presentation using that library). This would be a lot more "heavy lifting" and a tax on everything you do for a column of this type, so this path is not advisable unless you really need that upper range and you can't live with just the positive range of the signed type.

@jonathandune
Copy link
Author

Ooops sorry I forgot to share the url of the parquet file !!
https://filebin.net/2fer0fw8fvp8041c

@jonathandune
Copy link
Author

The context of the use case is blockchain data in parquet files.
Large unsigned numbers are common in the blockchain space (256 bit integers are the norm for some standard datasets)

@jcferretti
Copy link
Member

Maybe crazy idea: if those long bit width integers (eg 256 as your example above) are used as IDs and not to actually do math on them, you may be better off loading them in DH as a byte array, or even more manageable, a hex string (or even shorter, base64 string). I think as a (sane, uniquely defined for this) hex or base64 string, both equality and inequality/range operations would work, which should suffice for id like data. 🤷‍♂️

To do this on the 64 bit case, you would first load the data as signed, and then use java code that is careful to look at the column data "as unsigned" to compute a new table with the new column.

But what I think is interesting here is, if you already have to deal with 256 bit integers, and since parquet doesn't support that, maybe you are already dealing with those 256 bit integers in some other way. Whatever you are doing there potentially would make sense to do for 64 bit integers too, for consistency.

@malhotrashivam
Copy link
Contributor

Hi @jonathandune, tagging you here so we don't lose track of the conversation.
Do you have a preference on what type the unsigned 64 bit ints should map to? Ryan and Cristian have suggested some approaches above. We can provide any further clarifications if you need any.

@jonathandune
Copy link
Author

There are cases where arithmetic will be required on those types so I think going with something standard like BigInteger might make sense ?
The parquet file I've provided has been generated using https://github.com/paradigmxyz/cryo there are some details on types used there and seems multiple fields would be impacted

@malhotrashivam
Copy link
Contributor

Hi @jonathandune, thanks for the reply. I will check internally and share the next steps with you soon.

@devinrsmith
Copy link
Member

I think BigInteger should be the default for this case; ideally, I'd like the ability to configure on a column-by-column basis to change the desired behavior. I could imagine some users preferring to keep as long and if needed, work with special math like com.google.common.primitives.UnsignedLongs.

@devinrsmith
Copy link
Member

Here's a public dataset that demonstrates the same issue:

from deephaven import parquet
from deephaven.experimental import s3

from datetime import timedelta

amazon_reviews = parquet.read(
    "s3://datasets-documentation/amazon_reviews/",
    special_instructions=s3.S3Instructions(
        "eu-west-3",
        anonymous_access=True,
        read_ahead_count=8,
        fragment_size=65536,
        read_timeout=timedelta(seconds=10),
    ),
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core Core development tasks parquet Related to the Parquet integration
Projects
None yet
Development

No branches or pull requests

5 participants