Skip to content

Commit

Permalink
Merge pull request #1 from NERC-CEH/data
Browse files Browse the repository at this point in the history
Data
  • Loading branch information
matthewcoole authored Sep 11, 2024
2 parents bd4f992 + ae7b049 commit 9c75cb2
Show file tree
Hide file tree
Showing 15 changed files with 662 additions and 0 deletions.
3 changes: 3 additions & 0 deletions .dvc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/config.local
/tmp
/cache
5 changes: 5 additions & 0 deletions .dvc/config
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[core]
remote = jasmin
['remote "jasmin"']
url = s3://dvc-test
endpointurl = https://llm-eval-o.s3-ext.jc.rl.ac.uk
3 changes: 3 additions & 0 deletions .dvcignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore
20 changes: 20 additions & 0 deletions .github/workflows/cml.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: CML
on: [push]
jobs:
train-and-report:
runs-on: ubuntu-latest
container: docker://ghcr.io/iterative/cml:0-dvc2-base1
steps:
- uses: actions/checkout@v3
- name: Train model
env:
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
pip install -r requirements.txt
python dummy-evaluation.py
# Create CML report
cat metrics.txt >> report.md
echo '![](./metrics.png "Violin Plot of Metrics")' >> report.md
cml comment create report.md
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -160,3 +160,8 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

metrics.txt
metrics.png
/data
gdrive-oauth.txt
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,9 @@
# llm-eval
Scripts and data for LLM evaluation.

This repository is setup to work with [DVC](https://dvc.org/) backed by a [JASMIN object store](https://help.jasmin.ac.uk/docs/short-term-project-storage/using-the-jasmin-object-store/). Please follow the instruction in [`dvc.md`](dvc.md) to get up and running.

## DVC and CML
Notes on the use of Data Version Control and Continuous Machine Learning:
- [DVC](dvc.md)
- [CML](cml.md)
38 changes: 38 additions & 0 deletions cml.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# CML

## Self-hosted runner
To setup a self-hosted runner to perform github actions with CML on a local machine with a GPU access. You will need to make sure that `docker` and `node` are installed, then run:
```shell
$ sudo npm install --location=global @dvcorg/cml
```
### Starting the runner
To start the runner you will need to create gh access token with `repo` and `workflow` permissions. Then run:
```shell
$ cml runner launch \
--repo=$REPO_URL \
--token=$ACCESS_TOKEN \
--labels="cml,gpu" \
--idle-timeout=3000
```
replacing `REPO_URL` with your github repository url and `ACCESS_TOKEN` with you gh access token.

The runner should provide a confirmation message when it is started, but you can check that it is available to your repository by going to github `<REPOSITORY> > Settings > Actions > Runners` and you should see the runner listed.

### GH Action
With the runner available you should now be able to create a workflow to utilise it. Below is an example of a basic action to use the local runner and print the available GPU spec and details:
```yaml
name: test_gpu
on: [push]
jobs:
run:
runs-on: [self-hosted,cml,gpu]
steps:
- uses: actions/checkout@v3
- uses: iterative/setup-cml@v1
- name: Chek GPU spec
env:
REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
nvidia-smi
```
Save this into `.github/workflows/test_gpu.yaml` and open a pull request. The action should execute and the output should provide you details about the available GPU.
6 changes: 6 additions & 0 deletions data.dvc
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
outs:
- md5: 9f50d9dbc781216d5aac93d599e190d7.dir
size: 376640
nfiles: 3
hash: md5
path: data
Binary file added docs/img/ragas_eval.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 20 additions & 0 deletions dummy-evaluation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.io as pio

metrics = {"answer_relevancy", "answer_correctness", "context_precision", "context_recall"}
dummy_data = {metric: np.random.rand(100) for metric in metrics}
df = pd.DataFrame(dummy_data)

with open("metrics.txt", "w") as f:
for col in df:
f.write(f"{col}: {df[col].mean()}\n")

pio.templates.default = "gridon"
fig = go.Figure()
metrics = [metric for metric in df.columns.to_list() if metric not in ["question", "ground_truth", "answer", "contexts"]]
for metric in metrics:
fig.add_trace(go.Violin(y=df[metric], name=metric, points="all", box_visible=True, meanline_visible=True))
fig.update_yaxes(range=[-0.02,1.02])
fig.write_image("metrics.png")
108 changes: 108 additions & 0 deletions dvc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# DVC Notes
Notes on how to accomplish various operation using [DVC](https://dvc.org/). Most of these are just distilled notes from the [DVC documentation](https://dvc.org/doc).

This document guides you through working with the `llm-eval` repo and working with DVC backed by a [JASMIN object store](https://help.jasmin.ac.uk/docs/short-term-project-storage/using-the-jasmin-object-store/). The instruction can then be taken and applied to any other repository that you may want to setup to work with DVC

# Working with this repository
## Setup
### Clone this repo
```shell
$ git clone [email protected]:NERC-CEH/llm-eval.git
$ cd llm-eval
```

### Installing DVC
DVC can be installed using `pip`, this will provide the basic CLI needed to execute commands with DVC. The recomended way if working with this repository is to create a new python virtual environment and then install the appropriate DVC packages via the `requirements.txt` file:
```shell
$ python -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
```
If you are working on a different repository the packages can be installed seperately:
```shell
$ pip install dvc
$ pip install dvc[s3]
```
> DVC remotes backed by various other technologies (besides s3) can be used. See the [DVC documentation](https://dvc.org/doc/user-guide/data-management/remote-storage#supported-storage-types) for details.
### Connecting to JASMIN object storage
#### Request access to JASMIN object store
This repository has a corresponding object store on JASMIN `llm-eval-o`. To work with the data in this repository managed by DVC you must request access to the object store from the object store manager [Matt Coole](mailto:[email protected]).

Once you have been granted `USER` access, log in through the [JASMIN Object Store Portal](https://s3-portal.jasmin.ac.uk/object-store/) and create an access key for the `llm-eval-o` object store. Instructions for creating keys can be found in the [JASMIN Documentation](https://help.jasmin.ac.uk/docs/short-term-project-storage/using-the-jasmin-object-store/#creating-an-access-key-and-secret).
> Make sure you store your secret somewhere safe as you will not be able to view it again after the initial creation of your key.
#### Configure Credentials
Once you have access to the object store and have created a key you will need to setup your credentials:
```shell
$ dvc remote modify --local myremote access_key_id <ACCES_KEY_ID>
$ dvc remote modify --local myremote secret_access_key <KEY_SECRET>
```
> Note the configuration for DVC is tracked in `.dvc/config` but your credentials should be stored in a seperate file (`.dvc/config.local`) which should not be tracked by version control to avoid secrets being leaked. Make sure to use `--local` when configuring credentials.
## Pulling data
Assuming that configuration and credentials have been set up correctly you should now be able to pull the data that is tracked by DVC from the JASMIN object store. This is done using the `dvc pull` command.
```shell
$ dvc pull
```
You should now be able to see the `data` folder and contents:
```
data
├── evaluation-sets
│   ├── eidc-eval.csv
│   └── eidc-eval-sample.csv
└── synthetic-datasets
└── eidc_rag_test_set.csv
```

## Making changes
To make changes to your data use `dvc add` on the local file and then use `dvc push` to push to the remote store. It is then important to commit the `.dvc` files to git as well e.g.
```shell
$ dvc add my-data-file.csv
$ dvc push
$ git commit my-data-file.csv.dvc -m 'Updated data file'
```
`my-data-file.csv.dvc` is a place holder that DVC creates to tell it about the files/folder being tracked. This place holder will be tracked by git and the actual data tracked by DVC.

DVC should also automatically add the file/directory to `.gitignore` so it won't end up being accidentally tracked in git as well.

### Moving data from git to DVC
Any files or folders that you add to DVC must not be tracked by git. To switch from tracking a file with git to DVC, first untrack it with git:
```shell
$ git rm --cached data-file-in-git.csv
```
then follow the steps [above](#Making changes) to add the file(s) to be tracked by DVC.

> Note: Whilst `dvc` commands seem to somewhat mirror `git` commands, there doesn't seem to be quite the same concept of a staging area. I would suggest that `dvc add` is more like an amalgamation (in DVC) of `git add` + `git commit`.
## Checking out versions
To switch between versions of your data tracked by DVC you can simply use `git checkout` as you typically would to checkout a particular version of code and then follow this up with `dvc checkout` to checkout the corresponding version of the data e.g.
```shell
$ git checkout c474fcc
$ dvc checkout
```

# Setting up a new repository
Up until here it was assumed you were working with a repository already setup with DVC, but to setup DVC on your own git repository there are just a few initial steps to configure:
## Initialise
To set up your own git repository to track any data files using DVC use `dvc init` in the repository's directory.
```shell
$ dvc init
```
You will see a `.dvc/` directory and a `.dvcignore` file which you should add to you version controlled files.

## Add remote
Now add a bucket to be used as a remote (make sure you create the bucket in your object store first):
```shell
dvc remote add jasmin s3://test-dvc
```
This will initially set up the remote to use the `test-dvc` bucket (check the config in `.dvc/config`).

Next you need to add the endpoint URL, if you are using a JASMIN object store this should look something like this:
```shell
dvc remote modify myremote endpointurl https://my-test-store-o.s3-ext.jc.rl.ac.uk
```
Where your object store is called `my-test-store`.

## Configure Credentials
Finally you can configure your credentials as described [above](#Connecting to JASMIN object storage).
Loading

0 comments on commit 9c75cb2

Please sign in to comment.