A quick CLI tool to test whether an LLM outperform another LLM based on the paper LLM-as-a-judge method.
- Install all the libraries
pip install -r requirements.txt
- Setup .env
cp .env.sample .env
by default it's using OLLAMA to run the judge work.
- Run the code
python main.py
The CLI will ask the question and response of LLM A and LLM B, and then will run the benchmark using MODELS as jury.