-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Parquet files in ShardedDataSource #764
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, Thaks
tests/test_sharded_dataset.py
Outdated
table = pa.Table.from_pydict(data) | ||
pq.write_table(table, f.name) | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not put this within the context manager and set delete to True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for doing this!
@@ -238,6 +243,11 @@ def open_shard_at_row(self, shard_name: str, row: int) -> Iterator[str]: | |||
data = json.load(f) | |||
for doc in data[row:]: | |||
yield doc[self.text_key] | |||
case ".parquet": | |||
table = pq.read_table(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_table is expensive in the general case and it would be better to look at the metadata to figure out which row group to start on and then use read_row_group
I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggesting this, incorporated this in #766
@@ -417,6 +427,24 @@ def open_shard_at_row(self, shard_name: str, row: int) -> Iterator[dict]: | |||
return iter(data[row:]) | |||
|
|||
|
|||
class ParquetDataSource(ShardedDataSource[dict]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally the TextUrlDataSource would also work with parquet files. That is what we use for training configs typically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I added a new test for this in #766
This PR creates a
ParquetDataSource
class to support loading.parquet
files.Closes #763