doc edits and bump to 0.3 (#153)

developmentseed · Jan 16, 2025 · 9666866 · 9666866
1 parent 53911fa
commit 9666866
Show file tree

Hide file tree

Showing 5 changed files with 104 additions and 30 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/README.md b/README.md
@@ -13,19 +13,17 @@
 
 Simple, fast integration with object storage services like Amazon S3, Google Cloud Storage, Azure Blob Storage, and S3-compliant APIs like Cloudflare R2.
 
-- Sync and async API.
-- Streaming downloads with configurable chunking.
-- Streaming uploads from async or sync iterators.
-- Streaming `list`, with no need to paginate.
+- Sync and async API with **full type hinting**.
+- **Streaming downloads** with configurable chunking.
+- **Streaming uploads** from async or sync iterators.
+- **Streaming list**, with no need to paginate.
+- Automatically uses [**multipart uploads**](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) for large file objects.
+- Support for **conditional put** ("put if not exists"), as well as custom tags and attributes.
+- Optionally return list results as [Arrow](https://arrow.apache.org/), which is faster than materializing Python `dict`s.
 - File-like object API and [fsspec](https://github.com/fsspec/filesystem_spec) integration.
-- Support for conditional put ("put if not exists"), as well as custom tags and attributes.
-- Automatically uses [multipart uploads](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) under the hood for large file objects.
-- Optionally return list results as [Arrow](https://arrow.apache.org/), which is faster than materializing Python `dict`/`list` objects.
 - Easy to install with no required Python dependencies.
 - The [underlying Rust library](https://docs.rs/object_store) is production quality and used in large scale production systems, such as the Rust package registry [crates.io](https://crates.io/).
-- Zero-copy data exchange between Rust and Python in `get_range`, `get_ranges`, `GetResult.bytes`, and `put` via the Python [buffer protocol](https://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/).
-- Simple API with static type checking.
-- Helpers for constructing from environment variables and `boto3.Session` objects
+- Zero-copy data exchange between Rust and Python via the [buffer protocol](https://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/).
 
 <!-- For Rust developers looking to add object_store support to their Python packages, refer to pyo3-object_store. -->
 

diff --git a/docs/cookbook.md b/docs/cookbook.md
@@ -65,13 +65,13 @@ for record_batch in stream:
 
 The Arrow record batch looks like the following:
 
-| path                                                                | last_modified             |     size | e_tag                                | version   |
-|:--------------------------------------------------------------------|:--------------------------|---------:|:-------------------------------------|:----------|
-| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/AOT.tif | 2020-09-30 20:25:56+00:00 |    50510 | "2e24c2ee324ea478f2f272dbd3f5ce69"   |           |
-| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B01.tif | 2020-09-30 20:22:48+00:00 |  1455332 | "a31b78e96748ccc2b21b827bef9850c1"   |           |
-| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B02.tif | 2020-09-30 20:23:19+00:00 | 38149405 | "d7a92f88ad19761081323165649ce799-5" |           |
-| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B03.tif | 2020-09-30 20:23:52+00:00 | 38123224 | "4b938b6969f1c16e5dd685e6599f115f-5" |           |
-| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B04.tif | 2020-09-30 20:24:21+00:00 | 39033591 | "4781b581cd32b2169d0b3d22bf40a8ef-5" |           |
+| path                                                                | last_modified             |     size | e_tag                                | version |
+| :------------------------------------------------------------------ | :------------------------ | -------: | :----------------------------------- | :------ |
+| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/AOT.tif | 2020-09-30 20:25:56+00:00 |    50510 | "2e24c2ee324ea478f2f272dbd3f5ce69"   |         |
+| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B01.tif | 2020-09-30 20:22:48+00:00 |  1455332 | "a31b78e96748ccc2b21b827bef9850c1"   |         |
+| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B02.tif | 2020-09-30 20:23:19+00:00 | 38149405 | "d7a92f88ad19761081323165649ce799-5" |         |
+| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B03.tif | 2020-09-30 20:23:52+00:00 | 38123224 | "4b938b6969f1c16e5dd685e6599f115f-5" |         |
+| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B04.tif | 2020-09-30 20:24:21+00:00 | 39033591 | "4781b581cd32b2169d0b3d22bf40a8ef-5" |         |
 
 ## Fetch objects
 
@@ -109,6 +109,21 @@ for chunk in stream:
 assert total_buffer_len == meta.size
 ```
 
+### Download to disk
+
+Using the response as an iterator ensures that we don't buffer the entire file
+into memory.
+
+```py
+import obstore as obs
+
+resp = obs.get(store, path)
+
+with open("output/file", "wb") as f:
+    for chunk in resp:
+        f.write(chunk)
+```
+
 ## Put object
 
 Use the [`obstore.put`][] function to atomically write data. `obstore.put` will automatically use [multipart uploads](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) for large input data.
@@ -160,7 +175,6 @@ content = bytes_iter()
 obs.put(store, path, content)
 ```
 
-
 Or async iterables:
 
 ```py
@@ -178,10 +192,55 @@ obs.put(store, path, content)
 
 ## Copy objects from one store to another
 
-Perhaps you have data in AWS S3 that you need to copy to Google Cloud Storage. It's easy to **stream** a `get` from one store directly to the `put` of another.
+Perhaps you have data in one store, say AWS S3, that you need to copy to another, say Google Cloud Storage.
+
+### In memory
+
+Download the file, collect its bytes in memory, then upload it. Note that this will materialize the entire file in memory.
+
+```py
+import obstore as obs
+
+store1 = get_object_store()
+store2 = get_object_store()
+
+path1 = "data/file1"
+path2 = "data/file2"
+
+buffer = obs.get(store1, path1).bytes()
+obs.put(store2, path2, buffer)
+```
+
+### Local file
+
+First download the file to disk, then upload it.
+
+```py
+from pathlib import Path
+import obstore as obs
+
+store1 = get_object_store()
+store2 = get_object_store()
+
+path1 = "data/file1"
+path2 = "data/file2"
+
+resp = obs.get(store1, path1)
+
+with open("temporary_file", "wb") as f:
+    for chunk in resp:
+        f.write(chunk)
+
+# Upload the path
+obs.put(store2, path2, Path("temporary_file"))
+```
+
+### Streaming
+
+It's easy to **stream** a download from one store directly as the upload to another. Only the given
 
 !!! note
-    Using the async API is required for this.
+Using the async API is currently required to use streaming copies.
 
 ```py
 import obstore as obs
@@ -190,16 +249,34 @@ store1 = get_object_store()
 store2 = get_object_store()
 
 path1 = "data/file1"
-path2 = "data/file1"
+path2 = "data/file2"
 
 # This only constructs the stream, it doesn't materialize the data in memory
-resp = await obs.get_async(store1, path1, timeout="2min")
+resp = await obs.get_async(store1, path1)
+# A streaming upload is created to copy the file to path2
+await obs.put_async(store2, path2, resp, chunk_size=chunk_size)
+```
+
+Or, by customizing the chunk size and the upload concurrency you can control memory overhead.
+
+```py
+resp = await obs.get_async(store1, path1)
+chunk_size = 5 * 1024 * 1024 # 5MB
+stream = resp.stream(min_chunk_size=chunk_size)
 
 # A streaming upload is created to copy the file to path2
-await obs.put_async(store2, path2)
+await obs.put_async(
+    store2,
+    path2,
+    stream,
+    chunk_size=chunk_size,
+    max_concurrency=12
+)
 ```
 
+This will start up to 12 concurrent uploads, each with around 5MB chunks, giving a total memory usage of up to _roughly_ 60MB for this copy.
+
 !!! note
-    You may need to increase the download timeout for large source files. The timeout defaults to 30 seconds, which may not be long enough to upload the file to the destination.
+You may need to increase the download timeout for large source files. The timeout defaults to 30 seconds, which may not be long enough to upload the file to the destination.
 
-    You may set the [`timeout` parameter][obstore.store.ClientConfig] in the `client_options` passed to the initial `get_async` call.
+    You may set the [`timeout` parameter][obstore.store.ClientConfig] in the `client_options` passed when creating the store.
diff --git a/obstore/Cargo.toml b/obstore/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "obstore"
-version = "0.3.0-beta.11"
+version = "0.3.0"
 authors = { workspace = true }
 edition = { workspace = true }
 description = "A Python interface to the Rust object_store crate, providing a uniform API for interacting with object storage services and local files."

diff --git a/obstore/python/obstore/_get.pyi b/obstore/python/obstore/_get.pyi
@@ -223,9 +223,8 @@ class BytesStream:
         }
         ```
 
-        To fix this, set the `timeout` parameter in the `client_options` passed to the
-        initial `get` or `get_async` call. See
-        [ClientConfig][obstore.store.ClientConfig].
+        To fix this, set the `timeout` parameter in the
+        [`client_options`][obstore.store.ClientConfig] passed when creating the store.
     """
 
     def __aiter__(self) -> BytesStream: