Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update and improve documentation #37

Merged
merged 1 commit into from
Oct 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@


## in progress
- Update to MLflow 2.7.1
- Update to [MLflow 2.7](https://github.com/mlflow/mlflow/releases/tag/v2.7.0)
- Improve `table_exists()` in `example_merlion.py`
- SQLAlchemy: Use server-side `now()` function for "autoincrement" columns
- Use SQLAlchemy patches and polyfills from `cratedb-toolkit`
- Update and improve documentation

## 2023-09-12 0.1.1
- Documentation: Improve "Container Usage" page
Expand All @@ -22,3 +23,6 @@
- Add example program `tracking_dummy.py`, and improve test infrastructure
- Documentation: Add information about how to connect to CrateDB Cloud
- CI: Add GHA workflows to build and publish OCI container images to GHCR
- Tests: Enable code coverage tracking
- Fix SQL DDL files, and add missing columns to make the Models tab load in the UI,
see GH-17. Thanks, @hammerhead.
19 changes: 11 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@

## About

An adapter wrapper for [MLflow] to use [CrateDB] as a storage database
for [MLflow Tracking].
An adapter for [MLflow] to use [CrateDB] as a storage database for [MLflow
Tracking]. MLflow is an open source platform to manage the whole ML lifecycle,
including experimentation, reproducibility, deployment, and a central model
registry.

[MLflow] is an open source platform to manage the ML lifecycle, including
experimentation, reproducibility, deployment, and a central model registry.

## Setup

Expand All @@ -33,11 +33,14 @@ mlflow-cratedb cratedb --version
```


## Usage
## Documentation

The [MLflow Tracking] subsystem is about recording and querying experiments, across
code, data, config, and results.

The MLflow adapter for CrateDB can be used in different ways. Please refer
to the [handbook] and the documentation about
[container usage].
to the [handbook], the documentation about [container usage], and the
[hands-on guidelines].


## Development
Expand All @@ -54,7 +57,6 @@ how to [install a development sandbox].
- [Python Package Index (PyPI)](https://pypi.org/project/mlflow-cratedb/)

### Contributions

This library is an open source project, and is [managed on GitHub].
Every kind of contribution, feedback, or patch, is much welcome. [Create an
issue] or submit a patch if you think we should include a new feature, or to
Expand Down Expand Up @@ -87,6 +89,7 @@ which is using [Merlion].
[Create an issue]: https://github.com/crate-workbench/mlflow-cratedb/issues
[development sandbox]: https://github.com/crate-workbench/mlflow-cratedb/blob/main/docs/development.md
[handbook]: https://github.com/crate-workbench/mlflow-cratedb/blob/main/docs/handbook.md
[hands-on guidelines]: https://github.com/crate/cratedb-examples/blob/main/framework/mlflow/readme.md
[Harutaka Kawamura]: https://github.com/harupy
[install a development sandbox]: https://github.com/crate-workbench/mlflow-cratedb/blob/main/docs/development.md
[LICENSE]: https://github.com/crate-workbench/mlflow-cratedb/blob/main/LICENSE
Expand Down
33 changes: 26 additions & 7 deletions docs/backlog.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,38 @@
# Backlog

## Iteration +2
- Bug: Fix SQL DDL files, see https://github.com/crate-workbench/mlflow-cratedb/issues/17

### General
- FIXME: `testdrive` is hardcoded here
- Apply database schema at connection time already, using `set search_path`.
In this spirit, tables do not need to be addressed everywhere in full-qualified notation.
- Other than the "MLflow Tracking" subsystem, is it sensible to unlock the "MLflow Model
Registry" subsystem as well, when possible at all?
- https://mlflow.org/docs/latest/model-registry.html
- Code: Refactor / break out the generic SQLALchemy polyfill patches into `crate-python` elegantly
Registry" subsystem as well, when possible at all? See GH-33.
https://mlflow.org/docs/latest/model-registry.html
- UX: Provide a Docker Compose file, which bundles the whole trinity
- Project: Move repository to the `crate` organization

### Documentation
- Docs: Add "About MLflow" section to README
- Docs: Add "What's inside" section to README
- Docs: Demonstrate the "Container Use" with CrateDB Cloud instead of Docker container running locally
- Docs: Demonstrate use of `--backend=databricks`
- Docs: Run an MLflow project from the given URI, using `mlflow run`
- Docs: Demonstrate the "automatic logging" feature
https://mlflow.org/docs/latest/tracking.html#automatic-logging
- UX: Provide a Docker Compose file, which bundles the whole trinity
- Project: Move repository to the `crate` organization
- Project: Release 0.2

## Iteration +3
- Examples: Add more examples, eventually using different ML libraries
- Docs: Explore `mlflow experiments search`
- UX: Set up Conda feedstock repository, corresponding to the upstream one
https://github.com/conda-forge/mlflow-feedstock
- According to @andnig, looking into MLflow, specifically using Ray, makes sense.
- https://github.com/ray-project/ray/blob/f3c86d17/doc/source/ray-air/deployment.rst#L12
- https://github.com/ray-project/ray/blob/f3c86d17/doc/source/tune/tutorials/tune_get_data_in_and_out.md?plain=1#L243
- https://github.com/ray-project/ray/blob/f3c86d17/doc/source/tune/api/logging.rst#L50
- https://github.com/ray-project/ray/blob/f3c86d17/doc/source/ray-core/_examples/datasets_train/datasets_train.py#L32
- https://github.com/ray-project/ray/blob/f3c86d17/doc/source/train/user-guides/experiment-tracking.rst#L196
- https://github.com/ray-project/ray/blob/f3c86d17/python/ray/air/tests/test_integration_mlflow.py#L253

## Iteration +4
- UX: CLI shortcut for `ddl/drop.sql`
Expand All @@ -34,3 +47,9 @@
- CI: Do not build OCI images on _each_ PR in the long run, it costs too many
resources. Instead, document how to run OCI builds on demand on specific
branches, when it is needed to ship images for testing purposes.
- Documentation: Write a few words about schema/table structure:
Data tables are in `doc`, while MLflow tables are in `mlflow`.
- Documentation: Write a few words about standalone programs, and the need to import `mlflow_cratedb` there,
see, for example, tracking_dummy.py#L53-L55.
Alternatively, find an "autoload" solution for Python?
- Code: Refactor / break out the generic SQLALchemy polyfill patches into `cratedb-toolkit`
5 changes: 5 additions & 0 deletions docs/container.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,11 @@ For building your own images, see the [development documentation](./development.
On GHCR, other than `latest` images for releases, there are also images for
each PR, as well as `nightly` ones.

On GHCR, you will find the following image tags:
- `latest`: Points to the most recent release.
- `main`: Builds of `main`, when pushing to this branch.
- `nightly`: Builds of `main`, each morning a 4 o'clock CEST.


## Docker and Podman

Expand Down
11 changes: 10 additions & 1 deletion docs/handbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,9 +105,18 @@ crash --hosts="${CRATEDB_HTTP_URL}" --schema=mlflow < mlflow_cratedb/adapter/ddl

## Remarks

Please note that you need to invoke the `mlflow-cratedb` command, which
For running the MLflow server, you need to invoke the `mlflow-cratedb` command, which
runs MLflow amalgamated with the necessary changes to support CrateDB.

In the same spirit, when running standalone programs, make sure to import the `mlflow_cratedb`
module, in order to bring in the needed amalgamations to make MLflow work with CrateDB.
It can be inspected within the `examples/tracking_dummy.py` program.
```python
def start_adapter():
logger.info("Initializing CrateDB adapter")
import mlflow_cratedb # noqa: F401
```

Also note that we recommend to use a dedicated schema for storing MLflow's
tables, for example `"mlflow"`. In that spirit, CrateDB's default schema
`"doc"` is not populated by any tables of 3rd-party systems.
Expand Down
1 change: 1 addition & 0 deletions tests/test_adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ def store_empty(store):
session.query(SqlExperiment).delete()
for mapper in Base.registry.mappers:
session.query(mapper.class_).delete()
# FIXME: `testdrive` is hardcoded here.
sql = f"REFRESH TABLE testdrive.{mapper.class_.__tablename__};"
session.execute(sa.text(sql))
yield store
Expand Down