Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC/Arrow evaluation #61

Closed
8 tasks done
chrisbc opened this issue Apr 12, 2024 · 1 comment
Closed
8 tasks done

POC/Arrow evaluation #61

chrisbc opened this issue Apr 12, 2024 · 1 comment

Comments

@chrisbc
Copy link
Member

chrisbc commented Apr 12, 2024

as we delve deeper into the EPIC #50 it becomes apparent that maybe dig0data tech like arrow can help. So, can we do this...

basic questions

  • convert THS objects into a arrow/parquet dataset that can be worked on easily using just regular FileSystemLike storage (including local and S3)
  • compare performance querying and process large task (eg hazard aggregation in THP
  • enumerate the pros/cons
  • is parquet the preferred serialisation format

Sub questions:

  • can we use arrows in-memory features and/or IPC techniques to boost performance and minimise file IO Plasma?
  • can we do partitioning in arrow (not just parquet) how does that work see
  • can we easily reshape datasets to optimise for different use-cases (3rd party , internal heavy compute)
  • can we use SQL-like querys
    also here SELECT ...

Future possibly

@chrisbc chrisbc self-assigned this Apr 12, 2024
@chrisbc chrisbc changed the title POC/Arrow evaluatoin POC/Arrow evaluation Apr 12, 2024
@chrisbc
Copy link
Member Author

chrisbc commented May 27, 2024

completed in #62

@chrisbc chrisbc closed this as completed May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant