Add getting started and cookbook pages to docs (#147)

* Add getting started and cookbook pages to docs * update lockfile * Update changelog * Fix Buffer -> Bytes in type hints * Improved put mode docstring * Finish cookbook * bump python beta * fix sentence
developmentseed · Jan 16, 2025 · 53911fa · 53911fa
1 parent 6f0460f
commit 53911fa
Show file tree

Hide file tree

Showing 11 changed files with 406 additions and 133 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,10 +1,51 @@
 # Changelog
 
-## [0.3.0] -
+## [0.3.0] - 2025-01-16
 
-### Breaking changes
+### New Features :magic_wand:
 
-- `get_range`, `get_range_async`, `get_ranges`, and `get_ranges_async` now use **start/end** instead of **offset/length**. This is for consistency with the `range` option of `obstore.get`.
+- **Streaming uploads**. `obstore.put` now supports iterable input, and `obstore.put_async` now supports async iterable input. This means you can pass the output of `obstore.get_async` directly into `obstore.put_async`. by @kylebarron in https://github.com/developmentseed/obstore/pull/54
+- **Allow passing config options directly** as keyword arguments. Previously, you had to pass all options as a `dict` into the `config` parameter. Now you can pass the elements directly to the store constructor. by @kylebarron in https://github.com/developmentseed/obstore/pull/144
+- **Readable file-like objects**. Open a readable file-like object with `obstore.open` and `obstore.open_async`. by @kylebarron in https://github.com/developmentseed/obstore/pull/33
+- **Fsspec integration** by @martindurant in https://github.com/developmentseed/obstore/pull/63
+- Prefix store by @kylebarron in https://github.com/developmentseed/obstore/pull/117
+- Python 3.13 wheels by @kylebarron in https://github.com/developmentseed/obstore/pull/95
+- Support python timedelta objects as duration config values by @kylebarron in https://github.com/developmentseed/obstore/pull/146
+- Add class constructors for store builders. Each store now has an `__init__` method, for easier construction. by @kylebarron in https://github.com/developmentseed/obstore/pull/141
+
+### Breaking changes :wrench:
+
+- `get_range`, `get_range_async`, `get_ranges`, and `get_ranges_async` now use **start/end** instead of **offset/length**. This is for consistency with the `range` option of `obstore.get`. by @kylebarron in https://github.com/developmentseed/obstore/pull/71
+
+* Return `Bytes` from `GetResult.bytes()` by @kylebarron in https://github.com/developmentseed/obstore/pull/134
+
+### Bug fixes :bug:
+
+- boto3 region name can be None by @kylebarron in https://github.com/developmentseed/obstore/pull/59
+- add missing py.typed file by @gruebel in https://github.com/developmentseed/obstore/pull/115
+
+### Documentation :book:
+
+- FastAPI/Starlette example by @kylebarron in https://github.com/developmentseed/obstore/pull/145
+- Add conda installation doc to README by @kylebarron in https://github.com/developmentseed/obstore/pull/78
+- Document suggested lifecycle rules for aborted multipart uploads by @kylebarron in https://github.com/developmentseed/obstore/pull/139
+- Add type hint and documentation for requester pays by @kylebarron in https://github.com/developmentseed/obstore/pull/131
+- Add note that S3Store can be constructed without boto3 by @kylebarron in https://github.com/developmentseed/obstore/pull/108
+- HTTP Store usage example by @kylebarron in https://github.com/developmentseed/obstore/pull/142
+
+## What's Changed
+
+- Improved docs for from_url by @kylebarron in https://github.com/developmentseed/obstore/pull/138
+- Implement read_all for async iterable by @kylebarron in https://github.com/developmentseed/obstore/pull/140
+
+## New Contributors
+
+- @willemarcel made their first contribution in https://github.com/developmentseed/obstore/pull/64
+- @martindurant made their first contribution in https://github.com/developmentseed/obstore/pull/63
+- @norlandrhagen made their first contribution in https://github.com/developmentseed/obstore/pull/107
+- @gruebel made their first contribution in https://github.com/developmentseed/obstore/pull/115
+
+**Full Changelog**: https://github.com/developmentseed/obstore/compare/py-v0.2.0...py-v0.3.0
 
 ## [0.2.0] - 2024-10-25
 

diff --git a/Cargo.lock b/Cargo.lock
diff --git a/README.md b/README.md
@@ -23,7 +23,7 @@ Simple, fast integration with object storage services like Amazon S3, Google Clo
 - Optionally return list results as [Arrow](https://arrow.apache.org/), which is faster than materializing Python `dict`/`list` objects.
 - Easy to install with no required Python dependencies.
 - The [underlying Rust library](https://docs.rs/object_store) is production quality and used in large scale production systems, such as the Rust package registry [crates.io](https://crates.io/).
-- Zero-copy data exchange between Rust and Python in `get_range`, `get_ranges`, `GetResult.bytes`, and `put` via the Python buffer protocol.
+- Zero-copy data exchange between Rust and Python in `get_range`, `get_ranges`, `GetResult.bytes`, and `put` via the Python [buffer protocol](https://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/).
 - Simple API with static type checking.
 - Helpers for constructing from environment variables and `boto3.Session` objects
 
@@ -47,113 +47,4 @@ conda install -c conda-forge obstore
 
 [Full documentation is available on the website](https://developmentseed.org/obstore).
 
-## Usage
-
-### Constructing a store
-
-Classes to construct a store are exported from the `obstore.store` submodule:
-
-- [`S3Store`](https://developmentseed.org/obstore/latest/api/store/aws/): Configure a connection to Amazon S3.
-- [`GCSStore`](https://developmentseed.org/obstore/latest/api/store/gcs/): Configure a connection to Google Cloud Storage.
-- [`AzureStore`](https://developmentseed.org/obstore/latest/api/store/azure/): Configure a connection to Microsoft Azure Blob Storage.
-- [`HTTPStore`](https://developmentseed.org/obstore/latest/api/store/http/): Configure a connection to a generic HTTP server
-- [`LocalStore`](https://developmentseed.org/obstore/latest/api/store/local/): Local filesystem storage providing the same object store interface.
-- [`MemoryStore`](https://developmentseed.org/obstore/latest/api/store/memory/): A fully in-memory implementation of ObjectStore.
-
-Additionally, some middlewares exist:
-
-- [`PrefixStore`](https://developmentseed.org/obstore/latest/api/store/middleware/#obstore.store.PrefixStore): Store wrapper that applies a constant prefix to all paths handled by the store.
-
-#### Example
-
-```py
-from obstore.store import S3Store
-
-session = boto3.Session()
-store = S3Store("bucket-name", region="us-east-1")
-```
-
-#### Configuration
-
-Each store class above has its own configuration, accessible through the `config` named parameter. This is covered in the docs, and string literals are in the type hints.
-
-Additional [HTTP client configuration](https://developmentseed.org/obstore/latest/api/store/config/) is available via the `client_options` named parameter.
-
-### Interacting with a store
-
-All methods for interacting with a store are exported as **top-level functions** (not methods on the `store` object):
-
-- [`copy`](https://developmentseed.org/obstore/latest/api/copy/): Copy an object from one path to another in the same object store.
-- [`delete`](https://developmentseed.org/obstore/latest/api/delete/): Delete the object at the specified location.
-- [`get`](https://developmentseed.org/obstore/latest/api/get/): Return the bytes that are stored at the specified location.
-- [`head`](https://developmentseed.org/obstore/latest/api/head/): Return the metadata for the specified location
-- [`list`](https://developmentseed.org/obstore/latest/api/list/): List all the objects with the given prefix.
-- [`put`](https://developmentseed.org/obstore/latest/api/put/): Save the provided buffer to the specified location.
-- [`rename`](https://developmentseed.org/obstore/latest/api/rename/): Move an object from one path to another in the same object store.
-
-There are a few additional APIs useful for specific use cases:
-
-- [`get_range`](https://developmentseed.org/obstore/latest/api/get/#obstore.get_range): Get a specific byte range from a file.
-- [`get_ranges`](https://developmentseed.org/obstore/latest/api/get/#obstore.get_ranges): Get multiple byte ranges from a single file.
-- [`list_with_delimiter`](https://developmentseed.org/obstore/latest/api/list/#obstore.list_with_delimiter): List objects within a specific directory.
-- [`sign`](https://developmentseed.org/obstore/latest/api/sign/): Create a signed URL.
-
-File-like object support is also provided:
-
-- [`open`](https://developmentseed.org/obstore/latest/api/file/#obstore.open): Open a remote object as a Python file-like object.
-- [`AsyncFsspecStore`](https://developmentseed.org/obstore/latest/api/fsspec/#obstore.fsspec.AsyncFsspecStore) adapter for use with [`fsspec`](https://github.com/fsspec/filesystem_spec).
-
-All methods have a comparable async method with the same name plus an `_async` suffix.
-
-#### Example
-
-```py
-import obstore as obs
-
-store = obs.store.MemoryStore()
-
-obs.put(store, "file.txt", b"hello world!")
-response = obs.get(store, "file.txt")
-response.meta
-# {'path': 'file.txt',
-#  'last_modified': datetime.datetime(2024, 10, 21, 16, 19, 45, 102620, tzinfo=datetime.timezone.utc),
-#  'size': 12,
-#  'e_tag': '0',
-#  'version': None}
-assert response.bytes() == b"hello world!"
-
-byte_range = obs.get_range(store, "file.txt", offset=0, length=5)
-assert byte_range == b"hello"
-
-obs.copy(store, "file.txt", "other.txt")
-assert obs.get(store, "other.txt").bytes() == b"hello world!"
-```
-
-All of these methods also have `async` counterparts, suffixed with `_async`.
-
-```py
-import obstore as obs
-
-store = obs.store.MemoryStore()
-
-await obs.put_async(store, "file.txt", b"hello world!")
-response = await obs.get_async(store, "file.txt")
-response.meta
-# {'path': 'file.txt',
-#  'last_modified': datetime.datetime(2024, 10, 21, 16, 20, 36, 477418, tzinfo=datetime.timezone.utc),
-#  'size': 12,
-#  'e_tag': '0',
-#  'version': None}
-assert await response.bytes_async() == b"hello world!"
-
-byte_range = await obs.get_range_async(store, "file.txt", offset=0, length=5)
-assert byte_range == b"hello"
-
-await obs.copy_async(store, "file.txt", "other.txt")
-resp = await obs.get_async(store, "other.txt")
-assert await resp.bytes_async() == b"hello world!"
-```
-
-## Comparison to object-store-python
-
-[Read a detailed comparison](https://github.com/roeap/object-store-python/issues/24#issuecomment-2422689636) to [`object-store-python`](https://github.com/roeap/object-store-python), a previous Python library that also wraps the same Rust `object_store` crate.
+Head to [Getting Started](https://developmentseed.org/obstore/latest/getting-started/) to dig in.
diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md
@@ -0,0 +1 @@
+../CHANGELOG.md
diff --git a/docs/cookbook.md b/docs/cookbook.md
@@ -0,0 +1,205 @@
+# Cookbook
+
+## List objects
+
+Use the [`obstore.list`][] method.
+
+```py
+import obstore as obs
+
+# Create a Store
+store = get_object_store()
+
+# Recursively list all files below the 'data' path.
+# 1. On AWS S3 this would be the 'data/' prefix
+# 2. On a local filesystem, this would be the 'data' directory
+prefix = "data"
+
+# Get a stream of metadata objects:
+list_stream = obs.list(store, prefix)
+
+# Print info
+for batch in list_stream:
+    for meta in batch:
+        print(f"Name: {meta.path}, size: {meta.size}")
+```
+
+## List objects as Arrow
+
+The default `list` behavior creates many small Python `dict`s. When listing a large bucket, generating these Python objects can add up to a lot of overhead.
+
+Instead, you may consider passing `return_arrow=True` to [`obstore.list`][] to return each chunk of list results as an [Arrow](https://arrow.apache.org/) [`RecordBatch`][arro3.core.RecordBatch]. This can be much faster than materializing Python objects for each row because Arrow can be shared zero-copy between Rust and Python.
+
+This Arrow integration requires the [`arro3-core` dependency](https://kylebarron.dev/arro3/latest/), a lightweight Arrow implementation. You can pass the emitted `RecordBatch` to [`pyarrow`](https://arrow.apache.org/docs/python/index.html) (zero-copy) by passing it to [`pyarrow.record_batch`][] or to [`polars`](https://pola.rs/) (also zero-copy) by passing it to `polars.DataFrame`.
+
+```py
+import obstore as obs
+
+# Create a Store
+store = get_object_store()
+
+# Get a stream of Arrow RecordBatches of metadata
+list_stream = obs.list(store, prefix="data", return_arrow=True)
+for record_batch in list_stream:
+    print(record_batch.num_rows)
+```
+
+Here's a working example with the [`sentinel-cogs` bucket](https://registry.opendata.aws/sentinel-2-l2a-cogs/) in AWS Open Data:
+
+```py
+import obstore as obs
+import pandas as pd
+import pyarrow as pa
+from obstore.store import S3Store
+
+store = S3Store("sentinel-cogs", region="us-west-2", skip_signature=True)
+stream = obs.list(store, chunk_size=20, return_arrow=True)
+
+for record_batch in stream:
+    # Convert to pyarrow (zero-copy), then to pandas for easy export to a
+    # Markdown table
+    df = pa.record_batch(record_batch).to_pandas()
+    print(df.iloc[:5].to_markdown(index=False))
+    break
+```
+
+The Arrow record batch looks like the following:
+
+| path                                                                | last_modified             |     size | e_tag                                | version   |
+|:--------------------------------------------------------------------|:--------------------------|---------:|:-------------------------------------|:----------|
+| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/AOT.tif | 2020-09-30 20:25:56+00:00 |    50510 | "2e24c2ee324ea478f2f272dbd3f5ce69"   |           |
+| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B01.tif | 2020-09-30 20:22:48+00:00 |  1455332 | "a31b78e96748ccc2b21b827bef9850c1"   |           |
+| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B02.tif | 2020-09-30 20:23:19+00:00 | 38149405 | "d7a92f88ad19761081323165649ce799-5" |           |
+| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B03.tif | 2020-09-30 20:23:52+00:00 | 38123224 | "4b938b6969f1c16e5dd685e6599f115f-5" |           |
+| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B04.tif | 2020-09-30 20:24:21+00:00 | 39033591 | "4781b581cd32b2169d0b3d22bf40a8ef-5" |           |
+
+## Fetch objects
+
+Use the [`obstore.get`][] function to fetch data bytes from remote storage or files in the local filesystem.
+
+```py
+import obstore as obs
+
+# Create a Store
+store = get_object_store()
+
+# Retrieve a specific file
+path = "data/file01.parquet"
+
+# Fetch just the file metadata
+meta = obs.head(store, path)
+print(meta)
+
+# Fetch the object including metadata
+result = obs.get(store, path)
+assert result.meta == meta
+
+# Buffer the entire object in memory
+buffer = result.bytes()
+assert len(buffer) == meta.size
+
+# Alternatively stream the bytes from object storage
+stream = obs.get(store, path).stream()
+
+# We can now iterate over the stream
+total_buffer_len = 0
+for chunk in stream:
+    total_buffer_len += len(chunk)
+
+assert total_buffer_len == meta.size
+```
+
+## Put object
+
+Use the [`obstore.put`][] function to atomically write data. `obstore.put` will automatically use [multipart uploads](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) for large input data.
+
+```py
+import obstore as obs
+
+store = get_object_store()
+path = "data/file1"
+content = b"hello"
+obs.put(store, path, content)
+```
+
+You can also upload local files:
+
+```py
+from pathlib import Path
+import obstore as obs
+
+store = get_object_store()
+path = "data/file1"
+content = Path("path/to/local/file")
+obs.put(store, path, content)
+```
+
+Or file-like objects:
+
+```py
+import obstore as obs
+
+store = get_object_store()
+path = "data/file1"
+with open("path/to/local/file", "rb") as content:
+    obs.put(store, path, content)
+```
+
+Or iterables:
+
+```py
+import obstore as obs
+
+def bytes_iter():
+    for i in range(5):
+        yield b"foo"
+
+store = get_object_store()
+path = "data/file1"
+content = bytes_iter()
+obs.put(store, path, content)
+```
+
+
+Or async iterables:
+
+```py
+import obstore as obs
+
+async def bytes_stream():
+    for i in range(5):
+        yield b"foo"
+
+store = get_object_store()
+path = "data/file1"
+content = bytes_stream()
+obs.put(store, path, content)
+```
+
+## Copy objects from one store to another
+
+Perhaps you have data in AWS S3 that you need to copy to Google Cloud Storage. It's easy to **stream** a `get` from one store directly to the `put` of another.
+
+!!! note
+    Using the async API is required for this.
+
+```py
+import obstore as obs
+
+store1 = get_object_store()
+store2 = get_object_store()
+
+path1 = "data/file1"
+path2 = "data/file1"
+
+# This only constructs the stream, it doesn't materialize the data in memory
+resp = await obs.get_async(store1, path1, timeout="2min")
+
+# A streaming upload is created to copy the file to path2
+await obs.put_async(store2, path2)
+```
+
+!!! note
+    You may need to increase the download timeout for large source files. The timeout defaults to 30 seconds, which may not be long enough to upload the file to the destination.
+
+    You may set the [`timeout` parameter][obstore.store.ClientConfig] in the `client_options` passed to the initial `get_async` call.