Support Parquet files in ShardedDataSource #764

nikil-ravi · 2024-10-13T22:59:24Z

This PR creates a ParquetDataSource class to support loading .parquet files.
Closes #763

abhinavg4

Looks good, Thaks

percyliang · 2024-10-14T01:52:21Z

tests/test_sharded_dataset.py

+        table = pa.Table.from_pydict(data)
+        pq.write_table(table, f.name)
+
+    try:


Why not put this within the context manager and set delete to True?

dlwh

thanks for doing this!

dlwh · 2024-10-14T06:19:38Z

src/levanter/data/sharded_datasource.py

@@ -238,6 +243,11 @@ def open_shard_at_row(self, shard_name: str, row: int) -> Iterator[str]:
                    data = json.load(f)
                    for doc in data[row:]:
                        yield doc[self.text_key]
+                case ".parquet":
+                    table = pq.read_table(f)


read_table is expensive in the general case and it would be better to look at the metadata to figure out which row group to start on and then use read_row_group I think

Thanks for suggesting this, incorporated this in #766

dlwh · 2024-10-14T06:20:26Z

src/levanter/data/sharded_datasource.py

@@ -417,6 +427,24 @@ def open_shard_at_row(self, shard_name: str, row: int) -> Iterator[dict]:
            return iter(data[row:])


+class ParquetDataSource(ShardedDataSource[dict]):


ideally the TextUrlDataSource would also work with parquet files. That is what we use for training configs typically

Makes sense, I added a new test for this in #766

Closes #763 and addresses David's comments in #764

nikil-ravi added 4 commits October 13, 2024 15:01

add parquet support

52bff4f

lint, shard name fix

af78281

pre-commit

8d09cfd

read as binary file

50715e9

abhinavg4 approved these changes Oct 13, 2024

View reviewed changes

nikil-ravi requested a review from dlwh October 13, 2024 23:30

percyliang reviewed Oct 14, 2024

View reviewed changes

percyliang approved these changes Oct 14, 2024

View reviewed changes

simplify test

3fe8995

nikil-ravi merged commit fc26c74 into main Oct 14, 2024
8 checks passed

nikil-ravi deleted the nikil/parquet branch October 14, 2024 02:19

dlwh reviewed Oct 14, 2024

View reviewed changes

This was referenced Oct 15, 2024

support loading parquet files in addition to jsonl.gz #763

Closed

Changes in how parquet is read #766

Merged

dlwh pushed a commit that referenced this pull request Oct 15, 2024

Changes in how parquet is read (#766)

877ca7e

Closes #763 and addresses David's comments in #764

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Parquet files in ShardedDataSource #764

Support Parquet files in ShardedDataSource #764

nikil-ravi commented Oct 13, 2024 •

edited

Loading

abhinavg4 left a comment

percyliang Oct 14, 2024

nikil-ravi Oct 14, 2024

dlwh left a comment

dlwh Oct 14, 2024

nikil-ravi Oct 15, 2024

dlwh Oct 14, 2024

nikil-ravi Oct 15, 2024

		@@ -417,6 +427,24 @@ def open_shard_at_row(self, shard_name: str, row: int) -> Iterator[dict]:
		return iter(data[row:])


		class ParquetDataSource(ShardedDataSource[dict]):

Support Parquet files in ShardedDataSource #764

Support Parquet files in ShardedDataSource #764

Conversation

nikil-ravi commented Oct 13, 2024 • edited Loading

abhinavg4 left a comment

Choose a reason for hiding this comment

percyliang Oct 14, 2024

Choose a reason for hiding this comment

nikil-ravi Oct 14, 2024

Choose a reason for hiding this comment

dlwh left a comment

Choose a reason for hiding this comment

dlwh Oct 14, 2024

Choose a reason for hiding this comment

nikil-ravi Oct 15, 2024

Choose a reason for hiding this comment

dlwh Oct 14, 2024

Choose a reason for hiding this comment

nikil-ravi Oct 15, 2024

Choose a reason for hiding this comment

nikil-ravi commented Oct 13, 2024 •

edited

Loading