Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add new GPU runner for E2E job, incorporate unit tests into existing runner #71

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
205 changes: 205 additions & 0 deletions .github/workflows/e2e-nvidia-a10g-x1.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# SPDX-License-Identifier: Apache-2.0

name: E2E (NVIDIA A10G x1)

on:
workflow_dispatch:
inputs:
pr_or_branch:
description: 'pull request number or branch name'
required: true
default: 'main'

jobs:
start-runner:
name: Start external EC2 runner
runs-on: ubuntu-latest
outputs:
label: ${{ steps.start-ec2-runner.outputs.label }}
ec2-instance-id: ${{ steps.start-ec2-runner.outputs.ec2-instance-id }}
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e071d42e0343502 # v4.0.2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Start EC2 runner
id: start-ec2-runner
uses: machulav/ec2-github-runner@fcfb31a5760dad1314a64a0e172b78ec6fc8a17e # v2.3.6
with:
mode: start
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
ec2-image-id: ami-00c51d9c1374eda97
ec2-instance-type: g5.2xlarge
subnet-id: subnet-02d230cffd9385bd4
security-group-id: sg-06300447c4a5fbef3
iam-role-name: instructlab-ci-runner
aws-resource-tags: >
[
{"Key": "Name", "Value": "instructlab-ci-github-runner"},
{"Key": "GitHubRepository", "Value": "${{ github.repository }}"}
]

e2e:
name: E2E Test
needs: start-runner
runs-on: ${{ needs.start-runner.outputs.label }}

permissions:
pull-requests: write

steps:
- name: Checkout instructlab/eval
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # v4.1.7
with:
# https://github.com/actions/checkout/issues/249
fetch-depth: 0

- name: Checkout instructlab/instructlab
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # v4.1.7
with:
repository: "instructlab/instructlab"
path: "instructlab"
fetch-depth: 0

- name: Determine if pr_or_branch is a PR number
id: check_pr
run: |
nvidia-smi
if [[ "${{ github.event.inputs.pr_or_branch }}" =~ ^[0-9]+$ ]]; then
echo "is_pr=true" >> "$GITHUB_OUTPUT"
else
echo "is_pr=false" >> "$GITHUB_OUTPUT"
fi

- name: Check if gh cli is installed
id: gh_cli
run: |
if command -v gh &> /dev/null ; then
echo "gh_cli_installed=true" >> "$GITHUB_OUTPUT"
else
echo "gh_cli_installed=false" >> "$GITHUB_OUTPUT"
fi

- name: Install gh CLI
if: steps.gh_cli.outputs.gh_cli_installed == 'false'
run: |
sudo dnf install 'dnf-command(config-manager)' -y
sudo dnf config-manager --add-repo https://cli.github.com/packages/rpm/gh-cli.repo
sudo dnf install gh --repo gh-cli -y

- name: test gh CLI
run: |
gh --version

- name: set default repo
run: |
gh repo set-default ${{ github.server_url }}/${{ github.repository }}
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

- name: Add comment to PR
if: steps.check_pr.outputs.is_pr == 'true'
run: |
gh pr comment "${{ github.event.inputs.pr_or_branch }}" -b "${{ github.workflow }} workflow launched on this PR: [View run](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }})"
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

- name: Fetch and checkout PR
if: steps.check_pr.outputs.is_pr == 'true'
run: |
gh pr checkout ${{ github.event.inputs.pr_or_branch }}
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

- name: Checkout branch
if: steps.check_pr.outputs.is_pr == 'false'
run: |
git checkout ${{ github.event.inputs.pr_or_branch }}

- name: Install Packages
run: |
cat /etc/os-release
sudo dnf install -y gcc gcc-c++ make git python3.11 python3.11-devel

- name: Install ilab
nathan-weinberg marked this conversation as resolved.
Show resolved Hide resolved
run: |
export CUDA_HOME="/usr/local/cuda"
export LD_LIBRRY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LIBRRY :)

export PATH="$PATH:$CUDA_HOME/bin"
python3.11 -m venv venv
. venv/bin/activate
nvidia-smi
sed 's/\[.*\]//' requirements.txt > constraints.txt
python3.11 -m pip cache remove llama_cpp_python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" python3.11 -m pip install --force-reinstall --no-binary llama_cpp_python -c constraints.txt llama_cpp_python
python3.11 -m pip install bitsandbytes

# TODO This should be added to instructlab-training
python3.11 -m pip install packaging wheel

python3.11 -m pip install instructlab-training[cuda]

# Install the local version of eval before installing the CLI so PR changes are included
python3.11 -m pip install .

python3.11 -m pip install instructlab
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Question) AFAIU this assumes that instructlab package never caps / pins the eval library in an incompatible way. Otherwise, this line could revert the eval package to the one from pypi. Is it an acceptable assumption?


- name: Run e2e test
run: |
nvidia-smi
# This env variable is used on GPUs with less vRAM. It only allows cuda to alloc small chunks of vRAM at a time, usually helping to avoid OOM errors.
# This is not a good solution for production code, as setting ENV variables for users isn't best practice. However, it is a helpful manual workaround.
export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128
. venv/bin/activate
# TODO: for some reason we need to reinstall DS in order to get fused adam support
# This means we need to manually rm and re-install a bunch of packages. Investigate why this is.
python3.11 -m pip uninstall -y deepspeed

python3.11 -m pip cache purge

DS_BUILD_CPU_ADAM=1 BUILD_UTILS=1 python3.11 -m pip install deepspeed

nvidia-smi

python3.11 -m pip show nvidia-nccl-cu12

cd instructlab
./scripts/basic-workflow-tests.sh -em

- name: Add comment to PR if the workflow failed
if: failure() && steps.check_pr.outputs.is_pr == 'true'
run: |
gh pr comment "${{ github.event.inputs.pr_or_branch }}" -b "e2e workflow failed on this PR: [View run](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}), please investigate."
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

- name: Add comment to PR if the workflow succeeded
if: success() && steps.check_pr.outputs.is_pr == 'true'
run: |
gh pr comment "${{ github.event.inputs.pr_or_branch }}" -b "e2e workflow succeeded on this PR: [View run](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}), congrats!"
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

stop-runner:
name: Stop external EC2 runner
needs:
- start-runner
- e2e
runs-on: ubuntu-latest
if: ${{ always() }}
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@e3dd6a429d7300a6a4c196c26e071d42e0343502 # v4.0.2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Stop EC2 runner
uses: machulav/ec2-github-runner@fcfb31a5760dad1314a64a0e172b78ec6fc8a17e # v2.3.6
with:
mode: stop
github-token: ${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}
label: ${{ needs.start-runner.outputs.label }}
ec2-instance-id: ${{ needs.start-runner.outputs.ec2-instance-id }}
108 changes: 108 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# SPDX-License-Identifier: Apache-2.0

name: Test

on:
push:
branches:
- "main"
- "release-**"
paths:
- '**.py'
- 'pyproject.toml'
- 'requirements*.txt'
- '.github/workflows/test.yml'
pull_request:
branches:
- "main"
- "release-**"
paths:
- '**.py'
- 'pyproject.toml'
- 'requirements*.txt'
- '.github/workflows/test.yml'

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
unit:
runs-on: ubuntu-gpu
steps:
# No step-security/harden-runner since this is a self-hosted runner
- name: Checkout instructlab/eval
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # v4.1.7
with:
# https://github.com/actions/checkout/issues/249
fetch-depth: 0

# this is needed for branch tests
- name: Checkout instructlab/taxonomy
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # v4.1.7
with:
repository: "instructlab/taxonomy"
path: "taxonomy"
fetch-depth: 0

# this is needed for judge_answer tests
- name: Checkout instructlab/instructlab
uses: actions/checkout@692973e3d937129bcbf40652eb9f2f61becf3332 # v4.1.7
with:
repository: "instructlab/instructlab"
path: "instructlab"
fetch-depth: 0

- name: Install system packages
run: |
sudo apt-get install -y cuda-toolkit git cmake build-essential virtualenv
nvidia-smi
sudo ls -l /dev/nvidia*

- name: Setup Python 3.11
uses: actions/setup-python@82c7e631bb3cdc910f68e0081d67478d79c6982d # v5.1.0
with:
python-version: 3.11
cache: pip
cache-dependency-path: |
**/pyproject.toml
**/requirements*.txt

- name: Remove llama-cpp-python from cache
run: |
pip cache remove llama_cpp_python

- name: Start inference server
run: |
export PATH="/home/runner/.local/bin:/usr/local/cuda/bin:$PATH"
cd instructlab
python3.11 -m venv cli_venv
. cli_venv/bin/activate
sed 's/\[.*\]//' requirements.txt > constraints.txt
python3.11 -m pip cache remove llama_cpp_python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" python3 -m pip install --no-binary llama_cpp_python -c constraints.txt llama_cpp_python
# needed for --4-bit-quant option to ilab train
python3.11 -m pip install bitsandbytes
# install instructlab
python3.11 -m pip install .
# start llama-cpp server
ilab model download --repository instructlab/granite-7b-lab-GGUF --filename granite-7b-lab-Q4_K_M.gguf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since what we want to do with ilab is pretty basic, I'd suggest we just install instructlab from pypi.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree except the current pypi package is pretty out-of-date - using it now is just going to require us to change a bunch of stuff when the next release comes out - but once 0.18.0 lands that'll make sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should install from pypi, installing instructlab from pypi won't change the ilab model download command

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathan-weinberg we can install the beta releases with --pre and then remove that once 0.18.0 lands

ilab model serve --model-path /home/runner/.local/share/instructlab/models/granite-7b-lab-Q4_K_M.gguf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the functional tests scripts we shut down the server to clean up. I wonder if we should split all of this into a shell script that manages ilab installation, server startup, pytest running, and server shutdown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, we shouldn't need ilab at all to run unit or functional tests for this library - it should operate independently of the CLI

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna look into some other possible approaches

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, we shouldn't need ilab at all to run unit or functional tests for this library - it should operate independently of the CLI

I think this makes sense for unit tests, but functional ones should prob still use a server from ilab etc


- name: Install dependencies
run: |
python3.11 -m venv venv
. venv/bin/activate
python3.11 -m pip install .
python3.11 -m pip install pytest

- name: Run unit tests
run: |
export INSTRUCTLAB_EVAL_FIRST_N_QUESTIONS=5
. venv/bin/activate
python3.11 -m pytest

- name: Remove llama-cpp-python from cache
if: always()
run: |
pip cache remove llama_cpp_python
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ coverage.xml
.hypothesis/
.pytest_cache/
cover/
eval_output/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taxonomy too?


# Translations
*.mo
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# eval

![Lint](https://github.com/instructlab/eval/actions/workflows/lint.yml/badge.svg?branch=main)
![Test](https://github.com/instructlab/eval/actions/workflows/test.yml/badge.svg?branch=main)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nathan-weinberg on line 38, 39 of this file can you mention how to run the tests with pytest individually and as a group?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

![Build](https://github.com/instructlab/eval/actions/workflows/pypi.yaml/badge.svg?branch=main)
![Release](https://img.shields.io/github/v/release/instructlab/eval)
![License](https://img.shields.io/github/license/instructlab/eval)
Expand Down
16 changes: 9 additions & 7 deletions tests/test_branch_gen_answers.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# First Party
from instructlab.eval.mt_bench import MTBenchBranchEvaluator

mt_bench_branch = MTBenchBranchEvaluator(
"instructlab/granite-7b-lab",
"instructlab/granite-7b-lab",
"../taxonomy",
"main",
)
mt_bench_branch.gen_answers("http://localhost:8000/v1")

def test_branch_gen_answers():
mt_bench_branch = MTBenchBranchEvaluator(
"instructlab/granite-7b-lab",
"instructlab/granite-7b-lab",
"taxonomy",
"main",
)
mt_bench_branch.gen_answers("http://localhost:8000/v1")
40 changes: 21 additions & 19 deletions tests/test_branch_judge_answers.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,26 @@
# First Party
from instructlab.eval.mt_bench import MTBenchBranchEvaluator

mt_bench_branch = MTBenchBranchEvaluator(
"instructlab/granite-7b-lab",
"instructlab/granite-7b-lab",
"../taxonomy",
"main",
)
qa_pairs, error_rate = mt_bench_branch.judge_answers("http://localhost:8000/v1")
print(f"Error Rate: {error_rate}")
print(f"QA Pair 0:")
pprint.pprint(qa_pairs[0])

print(f"qa_pairs length: {len(qa_pairs)}")
def test_branch_judge_answers():
mt_bench_branch = MTBenchBranchEvaluator(
"instructlab/granite-7b-lab",
"instructlab/granite-7b-lab",
"taxonomy",
"main",
)
qa_pairs, error_rate = mt_bench_branch.judge_answers("http://localhost:8000/v1")
print(f"Error Rate: {error_rate}")
print(f"QA Pair 0:")
pprint.pprint(qa_pairs[0])

for qa_pair in qa_pairs:
question_id = qa_pair.get("question_id")
assert question_id is not None
assert qa_pair.get("score") is not None
assert qa_pair.get("category") is not None
assert qa_pair.get("question") is not None
assert qa_pair.get("answer") is not None
assert qa_pair.get("qna_file") is not None
print(f"qa_pairs length: {len(qa_pairs)}")

for qa_pair in qa_pairs:
question_id = qa_pair.get("question_id")
assert question_id is not None
assert qa_pair.get("score") is not None
assert qa_pair.get("category") is not None
assert qa_pair.get("question") is not None
assert qa_pair.get("answer") is not None
assert qa_pair.get("qna_file") is not None
Loading
Loading