Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow up work for Iceberg Writing #5989 #6418

Open
malhotrashivam opened this issue Nov 22, 2024 · 1 comment
Open

Follow up work for Iceberg Writing #5989 #6418

malhotrashivam opened this issue Nov 22, 2024 · 1 comment
Assignees
Labels
feature request New feature or request iceberg
Milestone

Comments

@malhotrashivam
Copy link
Contributor

malhotrashivam commented Nov 22, 2024

Following is a list of follow up tasks for #5989 which should be done as and when needed:

  • Add support for writeDataFiles API in python.
  • Add support for overwrite API in Java and Python after fully understand the impact of concurrent writes.
  • Add support for create+append API which can do both these tasks in a single transaction
  • Improve support for adding a single partition value to a table being written out. (Check the TODO in source code for more details).
@malhotrashivam malhotrashivam added feature request New feature or request iceberg labels Nov 22, 2024
@malhotrashivam malhotrashivam added this to the Backlog milestone Nov 22, 2024
@malhotrashivam malhotrashivam self-assigned this Nov 22, 2024
@malhotrashivam
Copy link
Contributor Author

The feedback that we received on our community channel (link to slack discussion):

  • Partition handling needs some work to support time/date partitions properly as well as the various transform modifiers for the partitions such as day , year , etc from here: https://iceberg.apache.org/spec/#partition-transforms
  • Docs could use some updates to pointers a few things that took me by surprise:
    • Needing to specify s3.S3Instructions multiple times, i.e. catalog creation and then table writer creation as an example. (This is more about mising docs since this wasn't on the docs site I was going off of pydoc and missed that the data_instructions were needed.)
    • Clarifying where/when s3.S3Instructions are used vs the configuration supplied in properties would be helpful for folks especially if they need to go to a different provider than AWS.
    • Clarify that custom JAR's are needed to use anything but the rest/glue catalogs in AWS.
    • Clarify that when you pass partition_paths into the writer instance it actually generates new columns based on those paths, and that the columns should be removed from the table before calling write.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request iceberg
Projects
None yet
Development

No branches or pull requests

1 participant