Skip to content

Commit

Permalink
Add Backblaze data set mirrors
Browse files Browse the repository at this point in the history
  • Loading branch information
szarnyasg committed Jan 10, 2025
1 parent 4972b9a commit 2d9f20e
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion _posts/2025-01-10-union-by-name.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ The benchmark requires reading 16 GB of CSV files stored on S3 that have changin
The intent behind it is to process large datasets on small commodity hardware (which is a use case where we want to see DuckDB be helpful!).
The original post uses Linode, but for this post we selected the most similar AWS instance having the same amount of memory (c5d.large).

The CSV files are sources from [this Backblaze dataset](https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data#downloadingTheRawTestData) and loaded into an S3 bucket.
We use two quarters' of CSV files from the [Backblaze dataset](https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data#downloadingTheRawTestData) ([2023 Q2](https://blobs.duckdb.org/data/backblaze-data-2023-Q2.zip) and [2023 Q3](https://blobs.duckdb.org/data/backblaze-data-2023-Q3.zip)), which are placed in an S3 bucket.

I modified the query [from here](https://dataengineeringcentral.substack.com/i/141997113/duckdb-reading-gb-from-s-on-a-gb-machine) very slightly to remove the `ignore_errors = true` option.
The benchmark continued to use Python, but I'm just showing the SQL here for better syntax highlighting:
Expand Down

0 comments on commit 2d9f20e

Please sign in to comment.