Skip to content

Commit

Permalink
doc edits and bump to 0.3 (#153)
Browse files Browse the repository at this point in the history
  • Loading branch information
kylebarron authored Jan 16, 2025
1 parent 53911fa commit 9666866
Show file tree
Hide file tree
Showing 5 changed files with 104 additions and 30 deletions.
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 8 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,17 @@

Simple, fast integration with object storage services like Amazon S3, Google Cloud Storage, Azure Blob Storage, and S3-compliant APIs like Cloudflare R2.

- Sync and async API.
- Streaming downloads with configurable chunking.
- Streaming uploads from async or sync iterators.
- Streaming `list`, with no need to paginate.
- Sync and async API with **full type hinting**.
- **Streaming downloads** with configurable chunking.
- **Streaming uploads** from async or sync iterators.
- **Streaming list**, with no need to paginate.
- Automatically uses [**multipart uploads**](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) for large file objects.
- Support for **conditional put** ("put if not exists"), as well as custom tags and attributes.
- Optionally return list results as [Arrow](https://arrow.apache.org/), which is faster than materializing Python `dict`s.
- File-like object API and [fsspec](https://github.com/fsspec/filesystem_spec) integration.
- Support for conditional put ("put if not exists"), as well as custom tags and attributes.
- Automatically uses [multipart uploads](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) under the hood for large file objects.
- Optionally return list results as [Arrow](https://arrow.apache.org/), which is faster than materializing Python `dict`/`list` objects.
- Easy to install with no required Python dependencies.
- The [underlying Rust library](https://docs.rs/object_store) is production quality and used in large scale production systems, such as the Rust package registry [crates.io](https://crates.io/).
- Zero-copy data exchange between Rust and Python in `get_range`, `get_ranges`, `GetResult.bytes`, and `put` via the Python [buffer protocol](https://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/).
- Simple API with static type checking.
- Helpers for constructing from environment variables and `boto3.Session` objects
- Zero-copy data exchange between Rust and Python via the [buffer protocol](https://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/).

<!-- For Rust developers looking to add object_store support to their Python packages, refer to pyo3-object_store. -->

Expand Down
107 changes: 92 additions & 15 deletions docs/cookbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,13 +65,13 @@ for record_batch in stream:

The Arrow record batch looks like the following:

| path | last_modified | size | e_tag | version |
|:--------------------------------------------------------------------|:--------------------------|---------:|:-------------------------------------|:----------|
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/AOT.tif | 2020-09-30 20:25:56+00:00 | 50510 | "2e24c2ee324ea478f2f272dbd3f5ce69" | |
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B01.tif | 2020-09-30 20:22:48+00:00 | 1455332 | "a31b78e96748ccc2b21b827bef9850c1" | |
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B02.tif | 2020-09-30 20:23:19+00:00 | 38149405 | "d7a92f88ad19761081323165649ce799-5" | |
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B03.tif | 2020-09-30 20:23:52+00:00 | 38123224 | "4b938b6969f1c16e5dd685e6599f115f-5" | |
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B04.tif | 2020-09-30 20:24:21+00:00 | 39033591 | "4781b581cd32b2169d0b3d22bf40a8ef-5" | |
| path | last_modified | size | e_tag | version |
| :------------------------------------------------------------------ | :------------------------ | -------: | :----------------------------------- | :------ |
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/AOT.tif | 2020-09-30 20:25:56+00:00 | 50510 | "2e24c2ee324ea478f2f272dbd3f5ce69" | |
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B01.tif | 2020-09-30 20:22:48+00:00 | 1455332 | "a31b78e96748ccc2b21b827bef9850c1" | |
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B02.tif | 2020-09-30 20:23:19+00:00 | 38149405 | "d7a92f88ad19761081323165649ce799-5" | |
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B03.tif | 2020-09-30 20:23:52+00:00 | 38123224 | "4b938b6969f1c16e5dd685e6599f115f-5" | |
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B04.tif | 2020-09-30 20:24:21+00:00 | 39033591 | "4781b581cd32b2169d0b3d22bf40a8ef-5" | |

## Fetch objects

Expand Down Expand Up @@ -109,6 +109,21 @@ for chunk in stream:
assert total_buffer_len == meta.size
```

### Download to disk

Using the response as an iterator ensures that we don't buffer the entire file
into memory.

```py
import obstore as obs

resp = obs.get(store, path)

with open("output/file", "wb") as f:
for chunk in resp:
f.write(chunk)
```

## Put object

Use the [`obstore.put`][] function to atomically write data. `obstore.put` will automatically use [multipart uploads](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) for large input data.
Expand Down Expand Up @@ -160,7 +175,6 @@ content = bytes_iter()
obs.put(store, path, content)
```


Or async iterables:

```py
Expand All @@ -178,10 +192,55 @@ obs.put(store, path, content)

## Copy objects from one store to another

Perhaps you have data in AWS S3 that you need to copy to Google Cloud Storage. It's easy to **stream** a `get` from one store directly to the `put` of another.
Perhaps you have data in one store, say AWS S3, that you need to copy to another, say Google Cloud Storage.

### In memory

Download the file, collect its bytes in memory, then upload it. Note that this will materialize the entire file in memory.

```py
import obstore as obs

store1 = get_object_store()
store2 = get_object_store()

path1 = "data/file1"
path2 = "data/file2"

buffer = obs.get(store1, path1).bytes()
obs.put(store2, path2, buffer)
```

### Local file

First download the file to disk, then upload it.

```py
from pathlib import Path
import obstore as obs

store1 = get_object_store()
store2 = get_object_store()

path1 = "data/file1"
path2 = "data/file2"

resp = obs.get(store1, path1)

with open("temporary_file", "wb") as f:
for chunk in resp:
f.write(chunk)

# Upload the path
obs.put(store2, path2, Path("temporary_file"))
```

### Streaming

It's easy to **stream** a download from one store directly as the upload to another. Only the given

!!! note
Using the async API is required for this.
Using the async API is currently required to use streaming copies.

```py
import obstore as obs
Expand All @@ -190,16 +249,34 @@ store1 = get_object_store()
store2 = get_object_store()

path1 = "data/file1"
path2 = "data/file1"
path2 = "data/file2"

# This only constructs the stream, it doesn't materialize the data in memory
resp = await obs.get_async(store1, path1, timeout="2min")
resp = await obs.get_async(store1, path1)
# A streaming upload is created to copy the file to path2
await obs.put_async(store2, path2, resp, chunk_size=chunk_size)
```

Or, by customizing the chunk size and the upload concurrency you can control memory overhead.

```py
resp = await obs.get_async(store1, path1)
chunk_size = 5 * 1024 * 1024 # 5MB
stream = resp.stream(min_chunk_size=chunk_size)

# A streaming upload is created to copy the file to path2
await obs.put_async(store2, path2)
await obs.put_async(
store2,
path2,
stream,
chunk_size=chunk_size,
max_concurrency=12
)
```

This will start up to 12 concurrent uploads, each with around 5MB chunks, giving a total memory usage of up to _roughly_ 60MB for this copy.

!!! note
You may need to increase the download timeout for large source files. The timeout defaults to 30 seconds, which may not be long enough to upload the file to the destination.
You may need to increase the download timeout for large source files. The timeout defaults to 30 seconds, which may not be long enough to upload the file to the destination.

You may set the [`timeout` parameter][obstore.store.ClientConfig] in the `client_options` passed to the initial `get_async` call.
You may set the [`timeout` parameter][obstore.store.ClientConfig] in the `client_options` passed when creating the store.
2 changes: 1 addition & 1 deletion obstore/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "obstore"
version = "0.3.0-beta.11"
version = "0.3.0"
authors = { workspace = true }
edition = { workspace = true }
description = "A Python interface to the Rust object_store crate, providing a uniform API for interacting with object storage services and local files."
Expand Down
5 changes: 2 additions & 3 deletions obstore/python/obstore/_get.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -223,9 +223,8 @@ class BytesStream:
}
```
To fix this, set the `timeout` parameter in the `client_options` passed to the
initial `get` or `get_async` call. See
[ClientConfig][obstore.store.ClientConfig].
To fix this, set the `timeout` parameter in the
[`client_options`][obstore.store.ClientConfig] passed when creating the store.
"""

def __aiter__(self) -> BytesStream:
Expand Down

0 comments on commit 9666866

Please sign in to comment.