Skip to content

Commit

Permalink
Add getting started and cookbook pages to docs (#147)
Browse files Browse the repository at this point in the history
* Add getting started and cookbook pages to docs

* update lockfile

* Update changelog

* Fix Buffer -> Bytes in type hints

* Improved put mode docstring

* Finish cookbook

* bump python beta

* fix sentence
  • Loading branch information
kylebarron authored Jan 16, 2025
1 parent 6f0460f commit 53911fa
Show file tree
Hide file tree
Showing 11 changed files with 406 additions and 133 deletions.
47 changes: 44 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,51 @@
# Changelog

## [0.3.0] -
## [0.3.0] - 2025-01-16

### Breaking changes
### New Features :magic_wand:

- `get_range`, `get_range_async`, `get_ranges`, and `get_ranges_async` now use **start/end** instead of **offset/length**. This is for consistency with the `range` option of `obstore.get`.
- **Streaming uploads**. `obstore.put` now supports iterable input, and `obstore.put_async` now supports async iterable input. This means you can pass the output of `obstore.get_async` directly into `obstore.put_async`. by @kylebarron in https://github.com/developmentseed/obstore/pull/54
- **Allow passing config options directly** as keyword arguments. Previously, you had to pass all options as a `dict` into the `config` parameter. Now you can pass the elements directly to the store constructor. by @kylebarron in https://github.com/developmentseed/obstore/pull/144
- **Readable file-like objects**. Open a readable file-like object with `obstore.open` and `obstore.open_async`. by @kylebarron in https://github.com/developmentseed/obstore/pull/33
- **Fsspec integration** by @martindurant in https://github.com/developmentseed/obstore/pull/63
- Prefix store by @kylebarron in https://github.com/developmentseed/obstore/pull/117
- Python 3.13 wheels by @kylebarron in https://github.com/developmentseed/obstore/pull/95
- Support python timedelta objects as duration config values by @kylebarron in https://github.com/developmentseed/obstore/pull/146
- Add class constructors for store builders. Each store now has an `__init__` method, for easier construction. by @kylebarron in https://github.com/developmentseed/obstore/pull/141

### Breaking changes :wrench:

- `get_range`, `get_range_async`, `get_ranges`, and `get_ranges_async` now use **start/end** instead of **offset/length**. This is for consistency with the `range` option of `obstore.get`. by @kylebarron in https://github.com/developmentseed/obstore/pull/71

* Return `Bytes` from `GetResult.bytes()` by @kylebarron in https://github.com/developmentseed/obstore/pull/134

### Bug fixes :bug:

- boto3 region name can be None by @kylebarron in https://github.com/developmentseed/obstore/pull/59
- add missing py.typed file by @gruebel in https://github.com/developmentseed/obstore/pull/115

### Documentation :book:

- FastAPI/Starlette example by @kylebarron in https://github.com/developmentseed/obstore/pull/145
- Add conda installation doc to README by @kylebarron in https://github.com/developmentseed/obstore/pull/78
- Document suggested lifecycle rules for aborted multipart uploads by @kylebarron in https://github.com/developmentseed/obstore/pull/139
- Add type hint and documentation for requester pays by @kylebarron in https://github.com/developmentseed/obstore/pull/131
- Add note that S3Store can be constructed without boto3 by @kylebarron in https://github.com/developmentseed/obstore/pull/108
- HTTP Store usage example by @kylebarron in https://github.com/developmentseed/obstore/pull/142

## What's Changed

- Improved docs for from_url by @kylebarron in https://github.com/developmentseed/obstore/pull/138
- Implement read_all for async iterable by @kylebarron in https://github.com/developmentseed/obstore/pull/140

## New Contributors

- @willemarcel made their first contribution in https://github.com/developmentseed/obstore/pull/64
- @martindurant made their first contribution in https://github.com/developmentseed/obstore/pull/63
- @norlandrhagen made their first contribution in https://github.com/developmentseed/obstore/pull/107
- @gruebel made their first contribution in https://github.com/developmentseed/obstore/pull/115

**Full Changelog**: https://github.com/developmentseed/obstore/compare/py-v0.2.0...py-v0.3.0

## [0.2.0] - 2024-10-25

Expand Down
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

113 changes: 2 additions & 111 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Simple, fast integration with object storage services like Amazon S3, Google Clo
- Optionally return list results as [Arrow](https://arrow.apache.org/), which is faster than materializing Python `dict`/`list` objects.
- Easy to install with no required Python dependencies.
- The [underlying Rust library](https://docs.rs/object_store) is production quality and used in large scale production systems, such as the Rust package registry [crates.io](https://crates.io/).
- Zero-copy data exchange between Rust and Python in `get_range`, `get_ranges`, `GetResult.bytes`, and `put` via the Python buffer protocol.
- Zero-copy data exchange between Rust and Python in `get_range`, `get_ranges`, `GetResult.bytes`, and `put` via the Python [buffer protocol](https://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/).
- Simple API with static type checking.
- Helpers for constructing from environment variables and `boto3.Session` objects

Expand All @@ -47,113 +47,4 @@ conda install -c conda-forge obstore

[Full documentation is available on the website](https://developmentseed.org/obstore).

## Usage

### Constructing a store

Classes to construct a store are exported from the `obstore.store` submodule:

- [`S3Store`](https://developmentseed.org/obstore/latest/api/store/aws/): Configure a connection to Amazon S3.
- [`GCSStore`](https://developmentseed.org/obstore/latest/api/store/gcs/): Configure a connection to Google Cloud Storage.
- [`AzureStore`](https://developmentseed.org/obstore/latest/api/store/azure/): Configure a connection to Microsoft Azure Blob Storage.
- [`HTTPStore`](https://developmentseed.org/obstore/latest/api/store/http/): Configure a connection to a generic HTTP server
- [`LocalStore`](https://developmentseed.org/obstore/latest/api/store/local/): Local filesystem storage providing the same object store interface.
- [`MemoryStore`](https://developmentseed.org/obstore/latest/api/store/memory/): A fully in-memory implementation of ObjectStore.

Additionally, some middlewares exist:

- [`PrefixStore`](https://developmentseed.org/obstore/latest/api/store/middleware/#obstore.store.PrefixStore): Store wrapper that applies a constant prefix to all paths handled by the store.

#### Example

```py
from obstore.store import S3Store

session = boto3.Session()
store = S3Store("bucket-name", region="us-east-1")
```

#### Configuration

Each store class above has its own configuration, accessible through the `config` named parameter. This is covered in the docs, and string literals are in the type hints.

Additional [HTTP client configuration](https://developmentseed.org/obstore/latest/api/store/config/) is available via the `client_options` named parameter.

### Interacting with a store

All methods for interacting with a store are exported as **top-level functions** (not methods on the `store` object):

- [`copy`](https://developmentseed.org/obstore/latest/api/copy/): Copy an object from one path to another in the same object store.
- [`delete`](https://developmentseed.org/obstore/latest/api/delete/): Delete the object at the specified location.
- [`get`](https://developmentseed.org/obstore/latest/api/get/): Return the bytes that are stored at the specified location.
- [`head`](https://developmentseed.org/obstore/latest/api/head/): Return the metadata for the specified location
- [`list`](https://developmentseed.org/obstore/latest/api/list/): List all the objects with the given prefix.
- [`put`](https://developmentseed.org/obstore/latest/api/put/): Save the provided buffer to the specified location.
- [`rename`](https://developmentseed.org/obstore/latest/api/rename/): Move an object from one path to another in the same object store.

There are a few additional APIs useful for specific use cases:

- [`get_range`](https://developmentseed.org/obstore/latest/api/get/#obstore.get_range): Get a specific byte range from a file.
- [`get_ranges`](https://developmentseed.org/obstore/latest/api/get/#obstore.get_ranges): Get multiple byte ranges from a single file.
- [`list_with_delimiter`](https://developmentseed.org/obstore/latest/api/list/#obstore.list_with_delimiter): List objects within a specific directory.
- [`sign`](https://developmentseed.org/obstore/latest/api/sign/): Create a signed URL.

File-like object support is also provided:

- [`open`](https://developmentseed.org/obstore/latest/api/file/#obstore.open): Open a remote object as a Python file-like object.
- [`AsyncFsspecStore`](https://developmentseed.org/obstore/latest/api/fsspec/#obstore.fsspec.AsyncFsspecStore) adapter for use with [`fsspec`](https://github.com/fsspec/filesystem_spec).

All methods have a comparable async method with the same name plus an `_async` suffix.

#### Example

```py
import obstore as obs

store = obs.store.MemoryStore()

obs.put(store, "file.txt", b"hello world!")
response = obs.get(store, "file.txt")
response.meta
# {'path': 'file.txt',
# 'last_modified': datetime.datetime(2024, 10, 21, 16, 19, 45, 102620, tzinfo=datetime.timezone.utc),
# 'size': 12,
# 'e_tag': '0',
# 'version': None}
assert response.bytes() == b"hello world!"

byte_range = obs.get_range(store, "file.txt", offset=0, length=5)
assert byte_range == b"hello"

obs.copy(store, "file.txt", "other.txt")
assert obs.get(store, "other.txt").bytes() == b"hello world!"
```

All of these methods also have `async` counterparts, suffixed with `_async`.

```py
import obstore as obs

store = obs.store.MemoryStore()

await obs.put_async(store, "file.txt", b"hello world!")
response = await obs.get_async(store, "file.txt")
response.meta
# {'path': 'file.txt',
# 'last_modified': datetime.datetime(2024, 10, 21, 16, 20, 36, 477418, tzinfo=datetime.timezone.utc),
# 'size': 12,
# 'e_tag': '0',
# 'version': None}
assert await response.bytes_async() == b"hello world!"

byte_range = await obs.get_range_async(store, "file.txt", offset=0, length=5)
assert byte_range == b"hello"

await obs.copy_async(store, "file.txt", "other.txt")
resp = await obs.get_async(store, "other.txt")
assert await resp.bytes_async() == b"hello world!"
```

## Comparison to object-store-python

[Read a detailed comparison](https://github.com/roeap/object-store-python/issues/24#issuecomment-2422689636) to [`object-store-python`](https://github.com/roeap/object-store-python), a previous Python library that also wraps the same Rust `object_store` crate.
Head to [Getting Started](https://developmentseed.org/obstore/latest/getting-started/) to dig in.
1 change: 1 addition & 0 deletions docs/CHANGELOG.md
205 changes: 205 additions & 0 deletions docs/cookbook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# Cookbook

## List objects

Use the [`obstore.list`][] method.

```py
import obstore as obs

# Create a Store
store = get_object_store()

# Recursively list all files below the 'data' path.
# 1. On AWS S3 this would be the 'data/' prefix
# 2. On a local filesystem, this would be the 'data' directory
prefix = "data"

# Get a stream of metadata objects:
list_stream = obs.list(store, prefix)

# Print info
for batch in list_stream:
for meta in batch:
print(f"Name: {meta.path}, size: {meta.size}")
```

## List objects as Arrow

The default `list` behavior creates many small Python `dict`s. When listing a large bucket, generating these Python objects can add up to a lot of overhead.

Instead, you may consider passing `return_arrow=True` to [`obstore.list`][] to return each chunk of list results as an [Arrow](https://arrow.apache.org/) [`RecordBatch`][arro3.core.RecordBatch]. This can be much faster than materializing Python objects for each row because Arrow can be shared zero-copy between Rust and Python.

This Arrow integration requires the [`arro3-core` dependency](https://kylebarron.dev/arro3/latest/), a lightweight Arrow implementation. You can pass the emitted `RecordBatch` to [`pyarrow`](https://arrow.apache.org/docs/python/index.html) (zero-copy) by passing it to [`pyarrow.record_batch`][] or to [`polars`](https://pola.rs/) (also zero-copy) by passing it to `polars.DataFrame`.

```py
import obstore as obs

# Create a Store
store = get_object_store()

# Get a stream of Arrow RecordBatches of metadata
list_stream = obs.list(store, prefix="data", return_arrow=True)
for record_batch in list_stream:
print(record_batch.num_rows)
```

Here's a working example with the [`sentinel-cogs` bucket](https://registry.opendata.aws/sentinel-2-l2a-cogs/) in AWS Open Data:

```py
import obstore as obs
import pandas as pd
import pyarrow as pa
from obstore.store import S3Store

store = S3Store("sentinel-cogs", region="us-west-2", skip_signature=True)
stream = obs.list(store, chunk_size=20, return_arrow=True)

for record_batch in stream:
# Convert to pyarrow (zero-copy), then to pandas for easy export to a
# Markdown table
df = pa.record_batch(record_batch).to_pandas()
print(df.iloc[:5].to_markdown(index=False))
break
```

The Arrow record batch looks like the following:

| path | last_modified | size | e_tag | version |
|:--------------------------------------------------------------------|:--------------------------|---------:|:-------------------------------------|:----------|
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/AOT.tif | 2020-09-30 20:25:56+00:00 | 50510 | "2e24c2ee324ea478f2f272dbd3f5ce69" | |
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B01.tif | 2020-09-30 20:22:48+00:00 | 1455332 | "a31b78e96748ccc2b21b827bef9850c1" | |
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B02.tif | 2020-09-30 20:23:19+00:00 | 38149405 | "d7a92f88ad19761081323165649ce799-5" | |
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B03.tif | 2020-09-30 20:23:52+00:00 | 38123224 | "4b938b6969f1c16e5dd685e6599f115f-5" | |
| sentinel-s2-l2a-cogs/1/C/CV/2018/10/S2B_1CCV_20181004_0_L2A/B04.tif | 2020-09-30 20:24:21+00:00 | 39033591 | "4781b581cd32b2169d0b3d22bf40a8ef-5" | |

## Fetch objects

Use the [`obstore.get`][] function to fetch data bytes from remote storage or files in the local filesystem.

```py
import obstore as obs

# Create a Store
store = get_object_store()

# Retrieve a specific file
path = "data/file01.parquet"

# Fetch just the file metadata
meta = obs.head(store, path)
print(meta)

# Fetch the object including metadata
result = obs.get(store, path)
assert result.meta == meta

# Buffer the entire object in memory
buffer = result.bytes()
assert len(buffer) == meta.size

# Alternatively stream the bytes from object storage
stream = obs.get(store, path).stream()

# We can now iterate over the stream
total_buffer_len = 0
for chunk in stream:
total_buffer_len += len(chunk)

assert total_buffer_len == meta.size
```

## Put object

Use the [`obstore.put`][] function to atomically write data. `obstore.put` will automatically use [multipart uploads](https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html) for large input data.

```py
import obstore as obs

store = get_object_store()
path = "data/file1"
content = b"hello"
obs.put(store, path, content)
```

You can also upload local files:

```py
from pathlib import Path
import obstore as obs

store = get_object_store()
path = "data/file1"
content = Path("path/to/local/file")
obs.put(store, path, content)
```

Or file-like objects:

```py
import obstore as obs

store = get_object_store()
path = "data/file1"
with open("path/to/local/file", "rb") as content:
obs.put(store, path, content)
```

Or iterables:

```py
import obstore as obs

def bytes_iter():
for i in range(5):
yield b"foo"

store = get_object_store()
path = "data/file1"
content = bytes_iter()
obs.put(store, path, content)
```


Or async iterables:

```py
import obstore as obs

async def bytes_stream():
for i in range(5):
yield b"foo"

store = get_object_store()
path = "data/file1"
content = bytes_stream()
obs.put(store, path, content)
```

## Copy objects from one store to another

Perhaps you have data in AWS S3 that you need to copy to Google Cloud Storage. It's easy to **stream** a `get` from one store directly to the `put` of another.

!!! note
Using the async API is required for this.

```py
import obstore as obs

store1 = get_object_store()
store2 = get_object_store()

path1 = "data/file1"
path2 = "data/file1"

# This only constructs the stream, it doesn't materialize the data in memory
resp = await obs.get_async(store1, path1, timeout="2min")

# A streaming upload is created to copy the file to path2
await obs.put_async(store2, path2)
```

!!! note
You may need to increase the download timeout for large source files. The timeout defaults to 30 seconds, which may not be long enough to upload the file to the destination.

You may set the [`timeout` parameter][obstore.store.ClientConfig] in the `client_options` passed to the initial `get_async` call.
Loading

0 comments on commit 53911fa

Please sign in to comment.