Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write ORC from arrow recordbatches #15

Open
19 of 52 tasks
Jefffrey opened this issue Nov 3, 2023 · 2 comments
Open
19 of 52 tasks

Write ORC from arrow recordbatches #15

Jefffrey opened this issue Nov 3, 2023 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@Jefffrey
Copy link
Collaborator

Jefffrey commented Nov 3, 2023

Not a focus now, just raising issue here for tracking

Currently in progress.

Initial support

Tracked by initial-write-support branch

  • Merged to main, further development done directly to main

Checklist:

  • High level ArrowWriter synchronous interface (accepts RecordBatches to write)
  • Basic configuration via builder
  • Stripe writer
  • Metadata writer
  • Value encoding
    • Integer RLEv2
      • Short repeat
      • Direct
      • Delta
      • Patched base
    • Base 128 varint
    • Byte RLE
  • Encode nullability
  • Float/Double array
  • Short/Int/Long array
  • String/Binary array
  • Boolean array
  • Byte array
  • Basic struct array support (for root)

Once complete will raise PR for all the above, to provide a complete and usable writer (though lacking in features see below).

Subsequent features

Following items will be added in smaller PRs once base code of writer is merged to main.

  • Asynchronous interface
  • Compression
    • Zlib
    • Snappy
    • Lzo
    • Lz4
    • Zstd
  • Statistics
    • Int
    • Double
    • String
    • Bucket
    • Decimal
    • Date
    • Binary
    • Timestamp
  • Dictionary array
  • Run length array
  • Decimal array
  • Date array
  • Timestamp array
  • Compound array
    • Union array
    • Map array
    • List array
    • Struct array
  • Index streams
    • Row group index
    • Bloom filters
  • Extension configuration (see Java config for examples)
  • User metadata
  • Arrow type hint (when writing with this Arrow -> ORC writer, encode the original Arrow type in metadata so when reading, we can recreate original Arrow array)
  • TODO: other Arrow types
@Jefffrey
Copy link
Collaborator Author

Beginning work on this.

I'll be committing to the initial-write-support branch.

Will want to get a minimum end to end version before merging into main (so resultant PR might be big), supporting basic types like string/integer/float etc.

@Jefffrey
Copy link
Collaborator Author

datafusion-contrib/datafusion-orc#122

PR for initial support, without boolean and string (will be done subsequently since PR is already quite large)

@waynexia waynexia transferred this issue from datafusion-contrib/datafusion-orc Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant