From 89417ed9cca5b7e1c2de3bb56b20aa59179abe93 Mon Sep 17 00:00:00 2001 From: Andreas Motl Date: Tue, 10 Oct 2023 23:50:34 +0200 Subject: [PATCH] Update and improve documentation --- CHANGES.md | 6 +++++- README.md | 19 +++++++++++-------- docs/backlog.md | 33 ++++++++++++++++++++++++++------- docs/container.md | 5 +++++ docs/handbook.md | 11 ++++++++++- tests/test_adapter.py | 1 + 6 files changed, 58 insertions(+), 17 deletions(-) diff --git a/CHANGES.md b/CHANGES.md index cde9016..6217570 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -2,10 +2,11 @@ ## in progress -- Update to MLflow 2.7.1 +- Update to [MLflow 2.7](https://github.com/mlflow/mlflow/releases/tag/v2.7.0) - Improve `table_exists()` in `example_merlion.py` - SQLAlchemy: Use server-side `now()` function for "autoincrement" columns - Use SQLAlchemy patches and polyfills from `cratedb-toolkit` +- Update and improve documentation ## 2023-09-12 0.1.1 - Documentation: Improve "Container Usage" page @@ -22,3 +23,6 @@ - Add example program `tracking_dummy.py`, and improve test infrastructure - Documentation: Add information about how to connect to CrateDB Cloud - CI: Add GHA workflows to build and publish OCI container images to GHCR +- Tests: Enable code coverage tracking +- Fix SQL DDL files, and add missing columns to make the Models tab load in the UI, + see GH-17. Thanks, @hammerhead. diff --git a/README.md b/README.md index d91609c..0f24cc6 100644 --- a/README.md +++ b/README.md @@ -12,11 +12,11 @@ ## About -An adapter wrapper for [MLflow] to use [CrateDB] as a storage database -for [MLflow Tracking]. +An adapter for [MLflow] to use [CrateDB] as a storage database for [MLflow +Tracking]. MLflow is an open source platform to manage the whole ML lifecycle, +including experimentation, reproducibility, deployment, and a central model +registry. -[MLflow] is an open source platform to manage the ML lifecycle, including -experimentation, reproducibility, deployment, and a central model registry. ## Setup @@ -33,11 +33,14 @@ mlflow-cratedb cratedb --version ``` -## Usage +## Documentation + +The [MLflow Tracking] subsystem is about recording and querying experiments, across +code, data, config, and results. The MLflow adapter for CrateDB can be used in different ways. Please refer -to the [handbook] and the documentation about -[container usage]. +to the [handbook], the documentation about [container usage], and the +[hands-on guidelines]. ## Development @@ -54,7 +57,6 @@ how to [install a development sandbox]. - [Python Package Index (PyPI)](https://pypi.org/project/mlflow-cratedb/) ### Contributions - This library is an open source project, and is [managed on GitHub]. Every kind of contribution, feedback, or patch, is much welcome. [Create an issue] or submit a patch if you think we should include a new feature, or to @@ -87,6 +89,7 @@ which is using [Merlion]. [Create an issue]: https://github.com/crate-workbench/mlflow-cratedb/issues [development sandbox]: https://github.com/crate-workbench/mlflow-cratedb/blob/main/docs/development.md [handbook]: https://github.com/crate-workbench/mlflow-cratedb/blob/main/docs/handbook.md +[hands-on guidelines]: https://github.com/crate/cratedb-examples/blob/main/framework/mlflow/readme.md [Harutaka Kawamura]: https://github.com/harupy [install a development sandbox]: https://github.com/crate-workbench/mlflow-cratedb/blob/main/docs/development.md [LICENSE]: https://github.com/crate-workbench/mlflow-cratedb/blob/main/LICENSE diff --git a/docs/backlog.md b/docs/backlog.md index 3d716db..b10cee1 100644 --- a/docs/backlog.md +++ b/docs/backlog.md @@ -1,25 +1,38 @@ # Backlog ## Iteration +2 -- Bug: Fix SQL DDL files, see https://github.com/crate-workbench/mlflow-cratedb/issues/17 + +### General +- FIXME: `testdrive` is hardcoded here +- Apply database schema at connection time already, using `set search_path`. + In this spirit, tables do not need to be addressed everywhere in full-qualified notation. - Other than the "MLflow Tracking" subsystem, is it sensible to unlock the "MLflow Model - Registry" subsystem as well, when possible at all? -- https://mlflow.org/docs/latest/model-registry.html -- Code: Refactor / break out the generic SQLALchemy polyfill patches into `crate-python` elegantly + Registry" subsystem as well, when possible at all? See GH-33. + https://mlflow.org/docs/latest/model-registry.html +- UX: Provide a Docker Compose file, which bundles the whole trinity +- Project: Move repository to the `crate` organization + +### Documentation +- Docs: Add "About MLflow" section to README +- Docs: Add "What's inside" section to README - Docs: Demonstrate the "Container Use" with CrateDB Cloud instead of Docker container running locally - Docs: Demonstrate use of `--backend=databricks` - Docs: Run an MLflow project from the given URI, using `mlflow run` - Docs: Demonstrate the "automatic logging" feature https://mlflow.org/docs/latest/tracking.html#automatic-logging -- UX: Provide a Docker Compose file, which bundles the whole trinity -- Project: Move repository to the `crate` organization -- Project: Release 0.2 ## Iteration +3 - Examples: Add more examples, eventually using different ML libraries - Docs: Explore `mlflow experiments search` - UX: Set up Conda feedstock repository, corresponding to the upstream one https://github.com/conda-forge/mlflow-feedstock +- According to @andnig, looking into MLflow, specifically using Ray, makes sense. + - https://github.com/ray-project/ray/blob/f3c86d17/doc/source/ray-air/deployment.rst#L12 + - https://github.com/ray-project/ray/blob/f3c86d17/doc/source/tune/tutorials/tune_get_data_in_and_out.md?plain=1#L243 + - https://github.com/ray-project/ray/blob/f3c86d17/doc/source/tune/api/logging.rst#L50 + - https://github.com/ray-project/ray/blob/f3c86d17/doc/source/ray-core/_examples/datasets_train/datasets_train.py#L32 + - https://github.com/ray-project/ray/blob/f3c86d17/doc/source/train/user-guides/experiment-tracking.rst#L196 + - https://github.com/ray-project/ray/blob/f3c86d17/python/ray/air/tests/test_integration_mlflow.py#L253 ## Iteration +4 - UX: CLI shortcut for `ddl/drop.sql` @@ -34,3 +47,9 @@ - CI: Do not build OCI images on _each_ PR in the long run, it costs too many resources. Instead, document how to run OCI builds on demand on specific branches, when it is needed to ship images for testing purposes. +- Documentation: Write a few words about schema/table structure: + Data tables are in `doc`, while MLflow tables are in `mlflow`. +- Documentation: Write a few words about standalone programs, and the need to import `mlflow_cratedb` there, + see, for example, tracking_dummy.py#L53-L55. + Alternatively, find an "autoload" solution for Python? +- Code: Refactor / break out the generic SQLALchemy polyfill patches into `cratedb-toolkit` diff --git a/docs/container.md b/docs/container.md index 0cddaf4..06db5e3 100644 --- a/docs/container.md +++ b/docs/container.md @@ -18,6 +18,11 @@ For building your own images, see the [development documentation](./development. On GHCR, other than `latest` images for releases, there are also images for each PR, as well as `nightly` ones. +On GHCR, you will find the following image tags: +- `latest`: Points to the most recent release. +- `main`: Builds of `main`, when pushing to this branch. +- `nightly`: Builds of `main`, each morning a 4 o'clock CEST. + ## Docker and Podman diff --git a/docs/handbook.md b/docs/handbook.md index 4e90f4e..9919490 100644 --- a/docs/handbook.md +++ b/docs/handbook.md @@ -105,9 +105,18 @@ crash --hosts="${CRATEDB_HTTP_URL}" --schema=mlflow < mlflow_cratedb/adapter/ddl ## Remarks -Please note that you need to invoke the `mlflow-cratedb` command, which +For running the MLflow server, you need to invoke the `mlflow-cratedb` command, which runs MLflow amalgamated with the necessary changes to support CrateDB. +In the same spirit, when running standalone programs, make sure to import the `mlflow_cratedb` +module, in order to bring in the needed amalgamations to make MLflow work with CrateDB. +It can be inspected within the `examples/tracking_dummy.py` program. +```python +def start_adapter(): + logger.info("Initializing CrateDB adapter") + import mlflow_cratedb # noqa: F401 +``` + Also note that we recommend to use a dedicated schema for storing MLflow's tables, for example `"mlflow"`. In that spirit, CrateDB's default schema `"doc"` is not populated by any tables of 3rd-party systems. diff --git a/tests/test_adapter.py b/tests/test_adapter.py index a5875b6..6db8ede 100644 --- a/tests/test_adapter.py +++ b/tests/test_adapter.py @@ -27,6 +27,7 @@ def store_empty(store): session.query(SqlExperiment).delete() for mapper in Base.registry.mappers: session.query(mapper.class_).delete() + # FIXME: `testdrive` is hardcoded here. sql = f"REFRESH TABLE testdrive.{mapper.class_.__tablename__};" session.execute(sa.text(sql)) yield store