From 95398256a955754bdd7071d7d72aa0c2d3b91f69 Mon Sep 17 00:00:00 2001
From: Tejaswini Pedapati <teju.great@gmail.com>
Date: Mon, 20 Jan 2025 11:46:26 -0500
Subject: [PATCH] Updated documentation of LLM-as-Judge to reflect the latest
 changes. Updated criterion in examples to be more practical

---
 docs/docs/llm_as_judge.rst                    | 460 +++++++-----------
 ...llm_as_judge_direct_predefined_criteria.py |  27 +-
 ...s_judge_direct_user_criteria_no_catalog.py |   8 +-
 3 files changed, 185 insertions(+), 310 deletions(-)
diff --git a/docs/docs/llm_as_judge.rst b/docs/docs/llm_as_judge.rst
index 09a3adc6d8..4366d05b7b 100644
--- a/docs/docs/llm_as_judge.rst
+++ b/docs/docs/llm_as_judge.rst
@@ -44,20 +44,17 @@ Overview
 An LLM as a Judge metric consists of several essential components:
 
 1. The judge model, such as *Llama-3-8B-Instruct* or *gpt-3.5-turbo*, which evaluates the performance of other models.
-2. The platform responsible for executing the judge model, such as Huggingface or OpenAI API.
-3. The template used to construct prompts for the judge model. This template should be reflective of the judgment needed and usually incorporates both the input and output of the evaluated model. For instance:
-
-    .. code-block:: text
-
-        Please rate the clarity, coherence, and informativeness of the following summary on a scale of 1 to 10\\n Full text: {model_input}\\nSummary: {model_output}
-
-4. The format in which the judge model expects to receive prompts. For example:
-
-    .. code-block:: text
-
-        <INST>{input}</INST>
-
-5. Optionally, a system prompt to pass to the judge model. This can provide additional context for evaluation.
+2. The platform responsible for executing the judge model, such as Huggingface, OpenAI API and IBM's deployment platforms such as WatsonX and RITS.
+   A lot of these model and catalog combinations are already predefined in our catalog. The models are prefixed by metrics.llm_as_judge.direct followed by the platform and the model name.
+   For instance, metrics.llm_as_judge.direct.rits.llama3_1_70b refers to llama3 70B model that uses RITS deployment service.
+
+3. The criterion or criteria to evaluate the model's response. There are predefined criteria in the catalog and the user can also define a custom criterion.
+   Each criterion specifies fine-grained options that help steer the model to evaluate the response more precisely.
+   For instance the critertion "metrics.llm_as_judge.direct.criterias.answer_relevance" quantifies how much the model's response is relevant to the user's question.
+   It has four options that the model can choose from and they are excellent, acceptable, could be improved and bad. Each option also has a description of itself and a score associated with it.
+   The model uses these descriptions to identify which option the given response is closest to and returns them.
+   The user can also specify their own custom criteria. An example of this is included under the section **Creating a custom criterion**.
+   The user can specify more than one criteria too. This is illustrated in the **End to end example** section
 
 Understanding these components is crucial for effectively leveraging LLM as a judge metrics. With this foundation, let's examine  how to utilize and create these metrics in the Unitxt package.
 
@@ -67,322 +64,197 @@ Employing a pre-defined LLM as a judge metric is effortlessly achieved within Un
 
 The Unitxt catalog boasts a variety of preexisting LLM as judges that seamlessly integrate into your workflow.
 
-Let's consider an example of evaluating a *flan-t5-small* model on the MT-Bench benchmark, specifically utilizing the single model rating evaluation part of the benchmark. In this part, we provide the LLM as a Judge, the input provided to the model and the output it generation. The LLM as Judge is asked to rate how well the output of the model address the request in the input.
+Let's consider an example of evaluating a model's responses for relevance to the questions.
 
 To accomplish this evaluation, we require the following:
 
-1. A Unitxt dataset card containing MT-Bench inputs, which will serve as the input for our evaluated model.
-2. A Unitxt template to be paired with the card. As the MT-Bench dataset already includes full prompts, there is no need to construct one using a template; hence, we'll opt for the *empty* template, which just passes the input prompt from the dataset to the model.
-3. A unitxt format to be utilized with the card. Given that *flan* models do not demand special formatting of the inputs, we'll utilize the *empty* format here as well.
-4. An LLM as a judge metric leveraging the MT-Bench evaluation prompt.
-
-Fortunately, all these components are readily available in the Unitxt catalog, including a judge model based on *Mistral* from Huggingface that employs the MT-Bench format.
-From here, constructing the full unitxt recipe string is standard and straightforward:
+1. The questions that were input to the model
+2. The judge model and its deployment platform
+3. The pre-defined criterion, which in this case is metrics.llm_as_judge.direct.criterias.answer_relevance.
 
-.. code-block:: text
+We pass the criterion to the judge model's metric as criteria and the question as the context fields.
 
-    card=cards.mt_bench.generation.english_single_turn,
-    template=templates.empty,
-    format=formats.empty,
-    metrics=[metrics.llm_as_judge.rating.mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn]
+.. code-block:: python
 
-.. note::
+   data = [
+    {"question": "Who is Harry Potter?"},
+    {"question": "How can I protect myself from the wind while walking outside?"},
+    {"question": "What is a good low cost of living city in the US?"},
+]
 
-   Pay attention!
-   We are using the mistralai/Mistral-7B-Instruct-v0.2 model from Huggingface. Using this model requires you to agree to the Terms of Use on the model page and set the HUGGINGFACE_TOKEN environment argument. Other platforms might have different requirements. For example if you are using an LLM as judge based on the OpenAI platform, you will need to set your OpenAI api key.
+criterion = "metrics.llm_as_judge.direct.criterias.answer_relevance"
+metrics = [
+    f"metrics.llm_as_judge.direct.rits.llama3_1_70b[criteria={criterion}, context_fields=[question]]"
+]
 
+dataset = create_dataset(
+    task="tasks.qa.open", test_set=data, metrics=metrics, split="test"
+)
 
-The following code performs the desired evaluation:
+Once the metric is created, a dataset is created for the appropriate task.
 
 .. code-block:: python
+    dataset = create_dataset(
+    task="tasks.qa.open", test_set=data, metrics=metrics, split="test"
+)
 
-    from datasets import load_dataset
-    from unitxt.inference import HFPipelineBasedInferenceEngine
-    from unitxt import evaluate
+The model's responses are then evaluated by the judge model as follows:
 
-    # 1. Create the dataset
-    card = ("card=cards.mt_bench.generation.english_single_turn,"
-            "template=templates.empty,"
-            "format=formats.empty,"
-            "metrics=[metrics.llm_as_judge.rating.mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn]"
-            )
+.. code-block:: python
+predictions = [
+    """Harry Potter is a young wizard who becomes famous for surviving an attack by the dark wizard Voldemort, and later embarks on a journey to defeat him and uncover the truth about his past.""",
+    """You can protect yourself from the wind by wearing windproof clothing, layering up, and using accessories like hats, scarves, and gloves to cover exposed skin.""",
+    """A good low-cost-of-living city in the U.S. is San Francisco, California, known for its affordable housing and budget-friendly lifestyle.""",
+]
 
-    dataset = load_dataset("unitxt/data",
-                           card,
-                           split='test')
-    # 2. use inference module to infer based on the dataset inputs.
-    model = HFPipelineBasedInferenceEngine(model_name="google/flan-t5-small", max_new_tokens=32, use_fp16=True)
-    predictions = model(dataset)
+results = evaluate(predictions=predictions, data=dataset)
 
-    # 3. create a metric and evaluate the results.
-    results = evaluate(predictions=predictions, data=dataset)
+print("Global Scores:")
+print(results.global_scores.summary)
 
-    print(results.global_scores.summary)
+print("Instance Scores:")
+print(results.instance_scores.summary)
 
 
+Positional Bias
+--------------------------------------------
+Positional bias determines if the judge model favors an option owing to its placement within the list of available options rather than its instrinsic merit.
+Unitxt reports if the judge model has positional bias in the instance level summary.
 
-Creating a new LLM as a Judge Metric
+Creating a custom criterion
 -------------------------------------
-
-To construct a new LLM as a Judge metric, several key components must be defined:
-
-1. **Judge Model**: Select a model that will assess the performance of other models.
-2. **Execution Platform**: Choose the platform responsible for executing the judge model, such as Huggingface or OpenAI API.
-3. **The Judging Task**: This define the inputs the judge model expect to receive and its output. This is coupled with the template. Two common tasks are single model rating we saw above and pairwise model comparison, in which the outputs of two models is compared, to see which better addressed the required input.
-4. **Template**: Develop a template reflecting the criteria for judgment, usually incorporating both the input and output of the evaluated model.
-5. **Format**: Specify the format in which the judge model expects to receive prompts.
-6. **System Prompt (Optional)**: Optionally, include a system prompt to provide additional context for evaluation.
-
-Let's walk through an example of creating a new LLM as a Judge metric, specifically recreating the MT-Bench judge metric single-model-rating evaluation:
-
-1. **Selecting a Judge Model**: We will utilize the *mistralai/Mistral-7B-Instruct-v0.2* model from Huggingface as our judge model.
-2. **Selecting an Execution Platform**: We will opt to execute the model locally using Huggingface.
-
-    For this example, we will use the *HFPipelineBasedInferenceEngine* class:
-
-    .. code-block:: python
-
-        from unitxt.inference import HFPipelineBasedInferenceEngine
-        from unitxt.llm_as_judge import LLMAsJudge
-
-        model_id = "mistralai/Mistral-7B-Instruct-v0.2"
-        inference_model = HFPipelineBasedInferenceEngine(model_name=model_id, max_generated_tokens=256)
-
-
-    .. note::
-
-        If you wish to use a different platform for running your judge model, you can implement
-        a new `InferenceEngine` class and substitute it with the `HFPipelineBasedInferenceEngine`.
-        You can find the definition of the `InferenceEngine` abstract class and pre-built inference engines
-        (e.g., `OpenAiInferenceEngine`) in `src/unitxt/inference.py`.
-
-
-3. **Selecting the Judging Task**: This is a standard Unitxt task that defines the api of the judge model. The task specifies the input fields expected by the judge model, such as "question" and "answer," in the example below, which are utilized in the subsequent template. Additionally, it defines the expected output field as a float type. Another significant field is "metrics," which is utilized for the (meta) evaluation of the judge, as explained in the following section. Currently supported tasks are "rating.single_turn" and "rating.single_turn_with_reference".
-
-    .. code-block:: python
-
-        from unitxt.blocks import Task
-        from unitxt.catalog import add_to_catalog
-
-        add_to_catalog(
-            Task(
-                inputs={"question": "str", "answer": "str"},
-                outputs={"rating": "float"},
-                metrics=["metrics.spearman"],
-            ),
-            "tasks.response_assessment.rating.single_turn",
-            overwrite=True,
-        )
-
-4. **Define the Template**: We want to construct a template that is identical to the MT-Bench judge metric. Pay attention that this metric have field that are compatible with the task we chose ("question", "answer" and "rating").
-
-    .. code-block:: python
-
-        from unitxt import add_to_catalog
-        from unitxt.templates import InputOutputTemplate
-
-        add_to_catalog(
-            InputOutputTemplate(
-                instruction="Please act as an impartial judge and evaluate the quality of the response provided"
-                " by an AI assistant to the user question displayed below. Your evaluation should consider"
-                " factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of"
-                " detail of the response. Begin your evaluation by providing a short explanation. Be as"
-                " objective as possible. After providing your explanation, you must rate the response"
-                ' on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example:'
-                ' "Rating: [[5]]".\n\n',
-                input_format="[Question]\n{question}\n\n"
-                "[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]",
-                output_format="[[{rating}]]",
-                postprocessors=[
-                    r"processors.extract_mt_bench_rating_judgment",
-                ],
-            ),
-            "templates.response_assessment.rating.mt_bench_single_turn",
-            overwrite=True,
-        )
-
-    .. note::
-
-        Ensure the template includes a postprocessor for extracting the judgment from the judge model output and
-        passing it as a metric score. In our example, the template specifies for the judge the expected judgment format
-        ("you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]""),
-        and such, it also defines the processor for extracting the judgment. (postprocessors=[r"processors.extract_mt_bench_rating_judgment"],).
-        This processor simply extract the number within [[ ]] and divide it by 10 in order to scale to to [0, 1].
-
-
-5. **Define Format**: Define the format expected by the judge model for receiving prompts. For Mitral models, you can use the format already available in the Unitxt catalog under *"formats.models.mistral.instruction""*.
-
-6. **Define System Prompt**: We will not use a system prompt in this example.
-
-With these components defined, creating a new LLM as a Judge metric is straightforward:
+As described above, the user can either choose a pre-defined criteria from the catalog or define their own criterion. Below is an example of how the user can define their own criterion.
+The criteria must have options and their descriptions for the judge model to choose from.
+Below is an example where the user mandates that the model respond with the temperature in both Celcius and Fahrenheit. The various possibilities are described in the options and each option is associated with a score that is specified in the score map.
 
 .. code-block:: python
 
-    from unitxt import add_to_catalog
-    from unitxt.inference import HFPipelineBasedInferenceEngine
-    from unitxt.llm_as_judge import LLMAsJudge
-
-    model_id = "mistralai/Mistral-7B-Instruct-v0.2"
-    format = "formats.models.mistral.instruction"
-    template = "templates.response_assessment.rating.mt_bench_single_turn"
-    task = "rating.single_turn"
-
-    inference_model = HFPipelineBasedInferenceEngine(
-        model_name=model_id, max_new_tokens=256, use_fp16=True
-    )
-    model_label = model_id.split("/")[1].replace("-", "_").replace(".", "_").lower()
-    model_label = f"{model_label}_huggingface"
-    template_label = template.split(".")[-1]
-    metric_label = f"{model_label}_template_{template_label}"
-    metric = LLMAsJudge(
-        inference_model=inference_model,
-        template=template,
-        task=task,
-        format=format,
-        main_score=metric_label,
-    )
-
-    add_to_catalog(
-        metric,
-        f"metrics.llm_as_judge.rating.{model_label}_template_{template_label}",
-        overwrite=True,
-    )
-
-
-
-.. note::
+from unitxt.llm_as_judge_constants import  CriteriaWithOptions
+
+criteria = CriteriaWithOptions.from_obj(
+    {
+        "name": "Temperature in Fahrenheit and Celsius",
+        "description": "In the response, if there is a numerical temperature present, is it denominated in both Fahrenheit and Celsius?",
+        "options": [
+            {
+                "name": "Correct",
+                "description": "The temperature reading is provided in both Fahrenheit and Celsius.",
+            },
+            {
+                "name": "Partially Correct",
+                "description": "The temperature reading is provided either in Fahrenheit or Celsius, but not both.",
+            },
+            {
+                "name": "Incorrect",
+                "description": "There is no numerical temperature reading in the response.",
+            },
+        ],
+        "option_map": {"Correct": 1.0, "Partially Correct": 0.5, "Incorrect": 0.0},
+    }
+)
 
-    The *LLMAsJudge* class can receive the boolean argument *strip_system_prompt_and_format_from_inputs*
-    (defaulting to *True*). When set to *True*, any system prompts or formatting in the inputs received by
-    the evaluated model will be stripped.
 
-Evaluating a LLMaJ metric (Meta-evaluation)
+End to end example
 --------------------------------------------
-But wait, we missed a step! We know the LLM as a judge we created worth anything?
-The answer is: You evaluate it like any other model in Unitxt.
-Remember the task we defined in the previous section?
-
-    .. code-block:: python
-
-        from unitxt.blocks import Task
-        from unitxt.catalog import add_to_catalog
-
-        add_to_catalog(
-            Task(
-                inputs={"question": "str", "answer": "str"},
-                outputs={"rating": "float"},
-                metrics=["metrics.spearman"],
-            ),
-            "tasks.response_assessment.rating.single_turn",
-            overwrite=True,
-        )
-
-This task define the (meta) evaluation of our LLMaJ model.
-We will fetch a dataset of MT-Bench inputs and models outputs, together with scores judged by GPT-4.
-We will consider these GPT4 scores as our gold labels and evaluate our LLMaJ model by comparing its score on the model outputs
-to the score of GPT4 using spearman correlation as defined in the task card.
-
-We will create a card, as we do for every other Unitxt scenario:
+Unitxt can also obtain model's responses for a given dataset and then run LLM-as-a-judge evaluations on the model's responses.
+Here, we will get llama-3.2 1B instruct's responses and then evaluate them for answer relevance, coherence and conciseness using llama3_1_70b judge model
 
 .. code-block:: python
-
-    from unitxt.blocks import (
-        TaskCard,
-    )
-    from unitxt.catalog import add_to_catalog
-    from unitxt.loaders import LoadHF
-    from unitxt.operators import (
-        Copy,
-        FilterByCondition,
-        Rename,
-    )
-    from unitxt.processors import LiteralEval
-    from unitxt.splitters import RenameSplits
-    from unitxt.test_utils.card import test_card
-
-    card = TaskCard(
-        loader=LoadHF(path="OfirArviv/mt_bench_single_score_gpt4_judgement", split="train"),
-        preprocess_steps=[
-            RenameSplits({"train": "test"}),
-            FilterByCondition(values={"turn": 1}, condition="eq"),
-            FilterByCondition(values={"reference": "[]"}, condition="eq"),
-            Rename(
-                field_to_field={
-                    "model_input": "question",
-                    "score": "rating",
-                    "category": "group",
-                    "model_output": "answer",
-                }
-            ),
-            LiteralEval(field="question"),
-            Copy(field="question/0", to_field="question"),
-            LiteralEval(field="answer"),
-            Copy(field="answer/0", to_field="answer"),
-        ],
-        task="tasks.response_assessment.rating.single_turn",
-        templates=["templates.response_assessment.rating.mt_bench_single_turn"],
+    criterias = ["answer_relevance", "coherence", "conciseness"]
+    metrics = [
+    "metrics.llm_as_judge.direct.rits.llama3_1_70b"
+    "[context_fields=[context,question],"
+    f"criteria=metrics.llm_as_judge.direct.criterias.{criteria},"
+    f"score_prefix={criteria}_]"
+    for criteria in criterias
+    ]
+    dataset = load_dataset(
+        card="cards.squad",
+        metrics=metrics,
+        loader_limit=10,
+        max_test_instances=10,
+        split="test",
     )
 
-    test_card(card, demos_taken_from="test", strict=False)
-    add_to_catalog(
-        card,
-        "cards.mt_bench.response_assessment.rating.single_turn_gpt4_judgement",
-        overwrite=True,
-    )
-
-This is a card for the first turn inputs of the MT-Bench benchmarks (without reference),
-together with the outputs of multiple models to those inputs and the scores of GPT-4
-to those outputs.
-
-Now all we need to do is to load the card, with the template and format the judge model is expected to use,
-and run it.
+We use CrossProviderInferenceEngine for inference.
 
 .. code-block:: python
+    inference_model = CrossProviderInferenceEngine(
+        model="llama-3-2-1b-instruct", provider="watsonx"
+    )
 
-    from datasets import load_dataset
-    from unitxt.inference import HFPipelineBasedInferenceEngine
-    from unitxt import evaluate
-
-    # 1. Create the dataset
-    card = ("card=cards.mt_bench.response_assessment.rating.single_turn_gpt4_judgement,"
-            "template=templates.response_assessment.rating.mt_bench_single_turn,"
-            "format=formats.models.mistral.instruction")
-
-    dataset = load_dataset("unitxt/data",
-                           card,
-                           split='test')
-    # 2. use inference module to infer based on the dataset inputs.
-    model = HFPipelineBasedInferenceEngine(model_name="mistralai/Mistral-7B-Instruct-v0.2",
-                                                     max_new_tokens=256,
-                                                     use_fp16=True)
-    predictions = model(dataset)
-    # 3. create a metric and evaluate the results.
-    results = evaluate(predictions=predictions, data=dataset)
-
-    print(results.global_scores.summary)
-
-The output of this code is:
+    predictions = inference_model.infer(dataset)
 
-.. code-block:: text
+    gold_answers = [d[0] for d in dataset["references"]]
 
-    ('spearmanr', 0.18328402960291354)
-    ('score', 0.18328402960291354)
-    ('score_name', 'spearmanr')
-    ('score_ci_low', 0.14680574316651868)
-    ('score_ci_high', 0.23030798909064645)
-    ('spearmanr_ci_low', 0.14680574316651868)
-    ('spearmanr_ci_high', 0.23030798909064645)
+    # Evaluate the predictions using the defined metric.
+    evaluated_predictions = evaluate(predictions=predictions, data=dataset)
+    evaluated_gold_answers = evaluate(predictions=gold_answers, data=dataset)
 
-We can see the Spearman correlation is *0.18*, which is considered low.
-This means *"mistralai/Mistral-7B-Instruct-v0.2"* is not a good model to act as an LLM as a Judge,
-at least when using the MT-Bench template.
+    print_dict(
+        evaluated_predictions[0],
+        keys_to_print=[
+            "source",
+            "score",
+        ],
+    )
+    print_dict(
+        evaluated_gold_answers[0],
+        keys_to_print=[
+            "source",
+            "score",
+        ],
+    )
 
-In order to understand precisely why it is so, examination of the outputs of the model is needed.
-In this case, it seems Mistral is having difficulties outputting the scores in the double square brackets format.
-An example for the model output is:
+    for criteria in criterias:
+        logger.info(f"Scores for criteria '{criteria}'")
+        gold_answer_scores = [
+            instance["score"]["instance"][f"{criteria}_llm_as_a_judge_score"]
+            for instance in evaluated_gold_answers
+        ]
+        gold_answer_position_bias = [
+            int(instance["score"]["instance"][f"{criteria}_positional_bias"])
+            for instance in evaluated_gold_answers
+        ]
+        prediction_scores = [
+            instance["score"]["instance"][f"{criteria}_llm_as_a_judge_score"]
+            for instance in evaluated_predictions
+        ]
+        prediction_position_bias = [
+            int(instance["score"]["instance"][f"{criteria}_positional_bias"])
+            for instance in evaluated_predictions
+        ]
+
+        logger.info(
+            f"Scores of gold answers: {statistics.mean(gold_answer_scores)} +/- {statistics.stdev(gold_answer_scores)}"
+        )
+        logger.info(
+            f"Scores of predicted answers: {statistics.mean(prediction_scores)} +/- {statistics.stdev(prediction_scores)}"
+        )
+        logger.info(
+            f"Positional bias occurrence on gold answers: {statistics.mean(gold_answer_position_bias)}"
+        )
+        logger.info(
+            f"Positional bias occurrence on predicted answers: {statistics.mean(prediction_position_bias)}\n"
+        )
 
 .. code-block:: text
-
-    Rating: 9
-
-    The assistant's response is engaging and provides a good balance between cultural experiences and must-see attractions in Hawaii. The description of the Polynesian Cultural Center and the Na Pali Coast are vivid and evoke a sense of wonder and excitement. The inclusion of traditional Hawaiian dishes adds depth and authenticity to the post. The response is also well-structured and easy to follow. However, the response could benefit from a few more specific details or anecdotes to make it even more engaging and memorable.
+    Output with 100 examples
+
+    Scores for criteria 'answer_relevance'
+    Scores of gold answers: 0.9625 +/- 0.14811526360619054
+    Scores of predicted answers: 0.5125 +/- 0.4638102516061385
+    Positional bias occurrence on gold answers: 0.03
+    Positional bias occurrence on predicted answers: 0.12
+
+    Scores for criteria 'coherence'
+    Scores of gold answers: 0.159 +/- 0.15689216524464028
+    Scores of predicted answers: 0.066 +/- 0.11121005695384194
+    Positional bias occurrence on gold answers: 0.16
+    Positional bias occurrence on predicted answers: 0.07
+
+    Scores for criteria 'conciseness'
+    Scores of gold answers: 1.0 +/- 0.0
+    Scores of predicted answers: 0.34 +/- 0.47609522856952335
+    Positional bias occurrence on gold answers: 0.03
+    Positional bias occurrence on predicted answers: 0.01
diff --git a/examples/evaluate_llm_as_judge_direct_predefined_criteria.py b/examples/evaluate_llm_as_judge_direct_predefined_criteria.py
index a5fe506761..76b530aa2f 100644
--- a/examples/evaluate_llm_as_judge_direct_predefined_criteria.py
+++ b/examples/evaluate_llm_as_judge_direct_predefined_criteria.py
@@ -4,14 +4,14 @@
 logger = get_logger()
 
 data = [
-    {"question": "How is the weather?"},
-    {"question": "How is the weather?"},
-    {"question": "How is the weather?"},
+    {"question": "Who is Harry Potter?"},
+    {"question": "How can I protect myself from the wind while walking outside?"},
+    {"question": "What is a good low cost of living city in the US?"},
 ]
 
-criteria = "metrics.llm_as_judge.direct.criterias.temperature_in_celsius_and_fahrenheit"
+criterion = "metrics.llm_as_judge.direct.criterias.answer_relevance"
 metrics = [
-    f"metrics.llm_as_judge.direct.rits.llama3_1_70b[criteria={criteria}, context_fields=[question]]"
+    f"metrics.llm_as_judge.direct.rits.llama3_1_70b[criteria={criterion}, context_fields=[question]]"
 ]
 
 dataset = create_dataset(
@@ -19,15 +19,18 @@
 )
 
 predictions = [
-    """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit (around 31-34°C). The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
-    """On most days, the weather is warm and humid, with temperatures often soaring into the high 80s and low 90s Fahrenheit. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
-    """On most days, the weather is warm and humid. The dense foliage of the jungle acts as a natural air conditioner, keeping the temperature relatively stable and comfortable for the inhabitants.""",
+    """Harry Potter is a young wizard who becomes famous for surviving an attack by the dark wizard Voldemort, and later embarks on a journey to defeat him and uncover the truth about his past.""",
+    """You can protect yourself from the wind by wearing windproof clothing, layering up, and using accessories like hats, scarves, and gloves to cover exposed skin.""",
+    """A good low-cost-of-living city in the U.S. is San Francisco, California, known for its affordable housing and budget-friendly lifestyle.""",
 ]
 
 results = evaluate(predictions=predictions, data=dataset)
+print(results)
 
-print("Global Scores:")
-print(results.global_scores.summary)
-
-print("Instance Scores:")
 print(results.instance_scores.summary)
+
+# print("Global Scores:")
+# print(results.global_scores.summary)
+
+# print("Instance Scores:")
+# print(results.instance_scores.summary)
diff --git a/examples/evaluate_llm_as_judge_direct_user_criteria_no_catalog.py b/examples/evaluate_llm_as_judge_direct_user_criteria_no_catalog.py
index 3f13a9e84e..8aeba51309 100644
--- a/examples/evaluate_llm_as_judge_direct_user_criteria_no_catalog.py
+++ b/examples/evaluate_llm_as_judge_direct_user_criteria_no_catalog.py
@@ -11,19 +11,19 @@
         "description": "In the response, if there is a numerical temperature present, is it denominated in both Fahrenheit and Celsius?",
         "options": [
             {
-                "name": "Yes",
+                "name": "Correct",
                 "description": "The temperature reading is provided in both Fahrenheit and Celsius.",
             },
             {
-                "name": "No",
+                "name": "Partially Correct",
                 "description": "The temperature reading is provided either in Fahrenheit or Celsius, but not both.",
             },
             {
-                "name": "Pass",
+                "name": "Incorrect",
                 "description": "There is no numerical temperature reading in the response.",
             },
         ],
-        "option_map": {"Yes": 1.0, "No": 0.5, "Pass": 0.0},
+        "option_map": {"Correct": 1.0, "Partially Correct": 0.5, "Incorrect": 0.0},
     }
 )