Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval assist documentation #1537

Merged
merged 2 commits into from
Jan 21, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 54 additions & 51 deletions docs/docs/llm_as_judge.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,12 +48,12 @@ An LLM as a Judge metric consists of several essential components:
A lot of these model and catalog combinations are already predefined in our catalog. The models are prefixed by metrics.llm_as_judge.direct followed by the platform and the model name.
For instance, metrics.llm_as_judge.direct.rits.llama3_1_70b refers to llama3 70B model that uses RITS deployment service.

3. The criterion or criteria to evaluate the model's response. There are predefined criteria in the catalog and the user can also define a custom criterion.
Each criterion specifies fine-grained options that help steer the model to evaluate the response more precisely.
3. The criteria to evaluate the model's response. There are predefined criteria in the catalog and the user can also define a custom criteria.
Each criteria specifies fine-grained options that help steer the model to evaluate the response more precisely.
For instance the critertion "metrics.llm_as_judge.direct.criterias.answer_relevance" quantifies how much the model's response is relevant to the user's question.
It has four options that the model can choose from and they are excellent, acceptable, could be improved and bad. Each option also has a description of itself and a score associated with it.
The model uses these descriptions to identify which option the given response is closest to and returns them.
The user can also specify their own custom criteria. An example of this is included under the section **Creating a custom criterion**.
The user can also specify their own custom criteria. An example of this is included under the section **Creating a custom criteria**.
The user can specify more than one criteria too. This is illustrated in the **End to end example** section
4. The Context fields are the additional fields beyond the evaluated response that are passed to the LLM as judge. This could be the reference answer, the question or the context provided to the model etc.
In the example below, the question that was input to the model is passed as a context field.
Expand All @@ -72,88 +72,88 @@ To accomplish this evaluation, we require the following:

1. The questions that were input to the model
2. The judge model and its deployment platform
3. The pre-defined criterion, which in this case is metrics.llm_as_judge.direct.criterias.answer_relevance.
3. The pre-defined criteria, which in this case is metrics.llm_as_judge.direct.criterias.answer_relevance.

We pass the criterion to the judge model's metric as criteria and the question as the context fields.
We pass the criteria to the judge model's metric as criteria and the question as the context fields.

.. code-block:: python

data = [
{"question": "Who is Harry Potter?"},
{"question": "How can I protect myself from the wind while walking outside?"},
{"question": "What is a good low cost of living city in the US?"},
]
]

criterion = "metrics.llm_as_judge.direct.criterias.answer_relevance"
metrics = [
f"metrics.llm_as_judge.direct.rits.llama3_1_70b[criteria={criterion}, context_fields=[question]]"
]
criteria = "metrics.llm_as_judge.direct.criterias.answer_relevance"
metrics = [
f"metrics.llm_as_judge.direct.rits.llama3_1_70b[criteria={criteria}, context_fields=[question]]"
]

dataset = create_dataset(
task="tasks.qa.open", test_set=data, metrics=metrics, split="test"
)
dataset = create_dataset(
task="tasks.qa.open", test_set=data, metrics=metrics, split="test"
)

Once the metric is created, a dataset is created for the appropriate task.

.. code-block:: python
dataset = create_dataset(
task="tasks.qa.open", test_set=data, metrics=metrics, split="test"
)

dataset = create_dataset(task="tasks.qa.open", test_set=data, metrics=metrics, split="test")

The model's responses are then evaluated by the judge model as follows:

.. code-block:: python
predictions = [
"""Harry Potter is a young wizard who becomes famous for surviving an attack by the dark wizard Voldemort, and later embarks on a journey to defeat him and uncover the truth about his past.""",
"""You can protect yourself from the wind by wearing windproof clothing, layering up, and using accessories like hats, scarves, and gloves to cover exposed skin.""",
"""A good low-cost-of-living city in the U.S. is San Francisco, California, known for its affordable housing and budget-friendly lifestyle.""",
]

results = evaluate(predictions=predictions, data=dataset)
predictions = [
"""Harry Potter is a young wizard who becomes famous for surviving an attack by the dark wizard Voldemort, and later embarks on a journey to defeat him and uncover the truth about his past.""",
"""You can protect yourself from the wind by wearing windproof clothing, layering up, and using accessories like hats, scarves, and gloves to cover exposed skin.""",
"""A good low-cost-of-living city in the U.S. is San Francisco, California, known for its affordable housing and budget-friendly lifestyle.""",
]

results = evaluate(predictions=predictions, data=dataset)

print("Global Scores:")
print(results.global_scores.summary)
print("Global Scores:")
print(results.global_scores.summary)

print("Instance Scores:")
print(results.instance_scores.summary)
print("Instance Scores:")
print(results.instance_scores.summary)


Positional Bias
--------------------------------------------
Positional bias determines if the judge model favors an option owing to its placement within the list of available options rather than its intrinsic merit.
Unitxt reports if the judge model has positional bias in the instance level summary.

Creating a custom criterion
Creating a custom criteria
-------------------------------------
As described above, the user can either choose a pre-defined criteria from the catalog or define their own criterion. Below is an example of how the user can define their own criterion.
As described above, the user can either choose a pre-defined criteria from the catalog or define their own criteria. Below is an example of how the user can define their own criteria.
The criteria must have options and their descriptions for the judge model to choose from.
Below is an example where the user mandates that the model respond with the temperature in both Celsius and Fahrenheit. The various possibilities are described in the options and each option is associated with a score that is specified in the score map.

.. code-block:: python

from unitxt.llm_as_judge_constants import CriteriaWithOptions

criteria = CriteriaWithOptions.from_obj(
{
"name": "Temperature in Fahrenheit and Celsius",
"description": "In the response, if there is a numerical temperature present, is it denominated in both Fahrenheit and Celsius?",
"options": [
{
"name": "Correct",
"description": "The temperature reading is provided in both Fahrenheit and Celsius.",
},
{
"name": "Partially Correct",
"description": "The temperature reading is provided either in Fahrenheit or Celsius, but not both.",
},
{
"name": "Incorrect",
"description": "There is no numerical temperature reading in the response.",
},
],
"option_map": {"Correct": 1.0, "Partially Correct": 0.5, "Incorrect": 0.0},
}
)
from unitxt.llm_as_judge_constants import CriteriaWithOptions

criteria = CriteriaWithOptions.from_obj(
{
"name": "Temperature in Fahrenheit and Celsius",
"description": "In the response, if there is a numerical temperature present, is it denominated in both Fahrenheit and Celsius?",
"options": [
{
"name": "Correct",
"description": "The temperature reading is provided in both Fahrenheit and Celsius.",
},
{
"name": "Partially Correct",
"description": "The temperature reading is provided either in Fahrenheit or Celsius, but not both.",
},
{
"name": "Incorrect",
"description": "There is no numerical temperature reading in the response.",
},
],
"option_map": {"Correct": 1.0, "Partially Correct": 0.5, "Incorrect": 0.0},
}
)


End to end example
Expand All @@ -162,6 +162,7 @@ Unitxt can also obtain model's responses for a given dataset and then run LLM-as
Here, we will get llama-3.2 1B instruct's responses and then evaluate them for answer relevance, coherence and conciseness using llama3_1_70b judge model

.. code-block:: python

criterias = ["answer_relevance", "coherence", "conciseness"]
metrics = [
"metrics.llm_as_judge.direct.rits.llama3_1_70b"
Expand All @@ -181,6 +182,7 @@ Here, we will get llama-3.2 1B instruct's responses and then evaluate them for a
We use CrossProviderInferenceEngine for inference.

.. code-block:: python

inference_model = CrossProviderInferenceEngine(
model="llama-3-2-1b-instruct", provider="watsonx"
)
Expand Down Expand Up @@ -241,6 +243,7 @@ We use CrossProviderInferenceEngine for inference.
)

.. code-block:: text

Output with 100 examples

Scores for criteria 'answer_relevance'
Expand Down
Loading