-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1 from NERC-CEH/data
Data
- Loading branch information
Showing
15 changed files
with
662 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
/config.local | ||
/tmp | ||
/cache |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
[core] | ||
remote = jasmin | ||
['remote "jasmin"'] | ||
url = s3://dvc-test | ||
endpointurl = https://llm-eval-o.s3-ext.jc.rl.ac.uk |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# Add patterns of files dvc should ignore, which could improve | ||
# the performance. Learn more at | ||
# https://dvc.org/doc/user-guide/dvcignore |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
name: CML | ||
on: [push] | ||
jobs: | ||
train-and-report: | ||
runs-on: ubuntu-latest | ||
container: docker://ghcr.io/iterative/cml:0-dvc2-base1 | ||
steps: | ||
- uses: actions/checkout@v3 | ||
- name: Train model | ||
env: | ||
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
run: | | ||
pip install -r requirements.txt | ||
python dummy-evaluation.py | ||
# Create CML report | ||
cat metrics.txt >> report.md | ||
echo '![](./metrics.png "Violin Plot of Metrics")' >> report.md | ||
cml comment create report.md | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,9 @@ | ||
# llm-eval | ||
Scripts and data for LLM evaluation. | ||
|
||
This repository is setup to work with [DVC](https://dvc.org/) backed by a [JASMIN object store](https://help.jasmin.ac.uk/docs/short-term-project-storage/using-the-jasmin-object-store/). Please follow the instruction in [`dvc.md`](dvc.md) to get up and running. | ||
|
||
## DVC and CML | ||
Notes on the use of Data Version Control and Continuous Machine Learning: | ||
- [DVC](dvc.md) | ||
- [CML](cml.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# CML | ||
|
||
## Self-hosted runner | ||
To setup a self-hosted runner to perform github actions with CML on a local machine with a GPU access. You will need to make sure that `docker` and `node` are installed, then run: | ||
```shell | ||
$ sudo npm install --location=global @dvcorg/cml | ||
``` | ||
### Starting the runner | ||
To start the runner you will need to create gh access token with `repo` and `workflow` permissions. Then run: | ||
```shell | ||
$ cml runner launch \ | ||
--repo=$REPO_URL \ | ||
--token=$ACCESS_TOKEN \ | ||
--labels="cml,gpu" \ | ||
--idle-timeout=3000 | ||
``` | ||
replacing `REPO_URL` with your github repository url and `ACCESS_TOKEN` with you gh access token. | ||
|
||
The runner should provide a confirmation message when it is started, but you can check that it is available to your repository by going to github `<REPOSITORY> > Settings > Actions > Runners` and you should see the runner listed. | ||
|
||
### GH Action | ||
With the runner available you should now be able to create a workflow to utilise it. Below is an example of a basic action to use the local runner and print the available GPU spec and details: | ||
```yaml | ||
name: test_gpu | ||
on: [push] | ||
jobs: | ||
run: | ||
runs-on: [self-hosted,cml,gpu] | ||
steps: | ||
- uses: actions/checkout@v3 | ||
- uses: iterative/setup-cml@v1 | ||
- name: Chek GPU spec | ||
env: | ||
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
run: | | ||
nvidia-smi | ||
``` | ||
Save this into `.github/workflows/test_gpu.yaml` and open a pull request. The action should execute and the output should provide you details about the available GPU. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
outs: | ||
- md5: 9f50d9dbc781216d5aac93d599e190d7.dir | ||
size: 376640 | ||
nfiles: 3 | ||
hash: md5 | ||
path: data |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
import numpy as np | ||
import pandas as pd | ||
import plotly.graph_objects as go | ||
import plotly.io as pio | ||
|
||
metrics = {"answer_relevancy", "answer_correctness", "context_precision", "context_recall"} | ||
dummy_data = {metric: np.random.rand(100) for metric in metrics} | ||
df = pd.DataFrame(dummy_data) | ||
|
||
with open("metrics.txt", "w") as f: | ||
for col in df: | ||
f.write(f"{col}: {df[col].mean()}\n") | ||
|
||
pio.templates.default = "gridon" | ||
fig = go.Figure() | ||
metrics = [metric for metric in df.columns.to_list() if metric not in ["question", "ground_truth", "answer", "contexts"]] | ||
for metric in metrics: | ||
fig.add_trace(go.Violin(y=df[metric], name=metric, points="all", box_visible=True, meanline_visible=True)) | ||
fig.update_yaxes(range=[-0.02,1.02]) | ||
fig.write_image("metrics.png") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
# DVC Notes | ||
Notes on how to accomplish various operation using [DVC](https://dvc.org/). Most of these are just distilled notes from the [DVC documentation](https://dvc.org/doc). | ||
|
||
This document guides you through working with the `llm-eval` repo and working with DVC backed by a [JASMIN object store](https://help.jasmin.ac.uk/docs/short-term-project-storage/using-the-jasmin-object-store/). The instruction can then be taken and applied to any other repository that you may want to setup to work with DVC | ||
|
||
# Working with this repository | ||
## Setup | ||
### Clone this repo | ||
```shell | ||
$ git clone [email protected]:NERC-CEH/llm-eval.git | ||
$ cd llm-eval | ||
``` | ||
|
||
### Installing DVC | ||
DVC can be installed using `pip`, this will provide the basic CLI needed to execute commands with DVC. The recomended way if working with this repository is to create a new python virtual environment and then install the appropriate DVC packages via the `requirements.txt` file: | ||
```shell | ||
$ python -m venv .venv | ||
$ source .venv/bin/activate | ||
$ pip install -r requirements.txt | ||
``` | ||
If you are working on a different repository the packages can be installed seperately: | ||
```shell | ||
$ pip install dvc | ||
$ pip install dvc[s3] | ||
``` | ||
> DVC remotes backed by various other technologies (besides s3) can be used. See the [DVC documentation](https://dvc.org/doc/user-guide/data-management/remote-storage#supported-storage-types) for details. | ||
### Connecting to JASMIN object storage | ||
#### Request access to JASMIN object store | ||
This repository has a corresponding object store on JASMIN `llm-eval-o`. To work with the data in this repository managed by DVC you must request access to the object store from the object store manager [Matt Coole](mailto:[email protected]). | ||
|
||
Once you have been granted `USER` access, log in through the [JASMIN Object Store Portal](https://s3-portal.jasmin.ac.uk/object-store/) and create an access key for the `llm-eval-o` object store. Instructions for creating keys can be found in the [JASMIN Documentation](https://help.jasmin.ac.uk/docs/short-term-project-storage/using-the-jasmin-object-store/#creating-an-access-key-and-secret). | ||
> Make sure you store your secret somewhere safe as you will not be able to view it again after the initial creation of your key. | ||
#### Configure Credentials | ||
Once you have access to the object store and have created a key you will need to setup your credentials: | ||
```shell | ||
$ dvc remote modify --local myremote access_key_id <ACCES_KEY_ID> | ||
$ dvc remote modify --local myremote secret_access_key <KEY_SECRET> | ||
``` | ||
> Note the configuration for DVC is tracked in `.dvc/config` but your credentials should be stored in a seperate file (`.dvc/config.local`) which should not be tracked by version control to avoid secrets being leaked. Make sure to use `--local` when configuring credentials. | ||
## Pulling data | ||
Assuming that configuration and credentials have been set up correctly you should now be able to pull the data that is tracked by DVC from the JASMIN object store. This is done using the `dvc pull` command. | ||
```shell | ||
$ dvc pull | ||
``` | ||
You should now be able to see the `data` folder and contents: | ||
``` | ||
data | ||
├── evaluation-sets | ||
│ ├── eidc-eval.csv | ||
│ └── eidc-eval-sample.csv | ||
└── synthetic-datasets | ||
└── eidc_rag_test_set.csv | ||
``` | ||
|
||
## Making changes | ||
To make changes to your data use `dvc add` on the local file and then use `dvc push` to push to the remote store. It is then important to commit the `.dvc` files to git as well e.g. | ||
```shell | ||
$ dvc add my-data-file.csv | ||
$ dvc push | ||
$ git commit my-data-file.csv.dvc -m 'Updated data file' | ||
``` | ||
`my-data-file.csv.dvc` is a place holder that DVC creates to tell it about the files/folder being tracked. This place holder will be tracked by git and the actual data tracked by DVC. | ||
|
||
DVC should also automatically add the file/directory to `.gitignore` so it won't end up being accidentally tracked in git as well. | ||
|
||
### Moving data from git to DVC | ||
Any files or folders that you add to DVC must not be tracked by git. To switch from tracking a file with git to DVC, first untrack it with git: | ||
```shell | ||
$ git rm --cached data-file-in-git.csv | ||
``` | ||
then follow the steps [above](#Making changes) to add the file(s) to be tracked by DVC. | ||
|
||
> Note: Whilst `dvc` commands seem to somewhat mirror `git` commands, there doesn't seem to be quite the same concept of a staging area. I would suggest that `dvc add` is more like an amalgamation (in DVC) of `git add` + `git commit`. | ||
## Checking out versions | ||
To switch between versions of your data tracked by DVC you can simply use `git checkout` as you typically would to checkout a particular version of code and then follow this up with `dvc checkout` to checkout the corresponding version of the data e.g. | ||
```shell | ||
$ git checkout c474fcc | ||
$ dvc checkout | ||
``` | ||
|
||
# Setting up a new repository | ||
Up until here it was assumed you were working with a repository already setup with DVC, but to setup DVC on your own git repository there are just a few initial steps to configure: | ||
## Initialise | ||
To set up your own git repository to track any data files using DVC use `dvc init` in the repository's directory. | ||
```shell | ||
$ dvc init | ||
``` | ||
You will see a `.dvc/` directory and a `.dvcignore` file which you should add to you version controlled files. | ||
|
||
## Add remote | ||
Now add a bucket to be used as a remote (make sure you create the bucket in your object store first): | ||
```shell | ||
dvc remote add jasmin s3://test-dvc | ||
``` | ||
This will initially set up the remote to use the `test-dvc` bucket (check the config in `.dvc/config`). | ||
|
||
Next you need to add the endpoint URL, if you are using a JASMIN object store this should look something like this: | ||
```shell | ||
dvc remote modify myremote endpointurl https://my-test-store-o.s3-ext.jc.rl.ac.uk | ||
``` | ||
Where your object store is called `my-test-store`. | ||
|
||
## Configure Credentials | ||
Finally you can configure your credentials as described [above](#Connecting to JASMIN object storage). |
Oops, something went wrong.