IBM · elronbandel · Jan 21, 2025 · Jan 21, 2025 · Jan 21, 2025
diff --git a/docs/docs/llm_as_judge.rst b/docs/docs/llm_as_judge.rst
@@ -48,12 +48,12 @@ An LLM as a Judge metric consists of several essential components:
    A lot of these model and catalog combinations are already predefined in our catalog. The models are prefixed by metrics.llm_as_judge.direct followed by the platform and the model name.
    For instance, metrics.llm_as_judge.direct.rits.llama3_1_70b refers to llama3 70B model that uses RITS deployment service.
 
-3. The criterion or criteria to evaluate the model's response. There are predefined criteria in the catalog and the user can also define a custom criterion.
-   Each criterion specifies fine-grained options that help steer the model to evaluate the response more precisely.
+3. The criteria to evaluate the model's response. There are predefined criteria in the catalog and the user can also define a custom criteria.
+   Each criteria specifies fine-grained options that help steer the model to evaluate the response more precisely.
    For instance the critertion "metrics.llm_as_judge.direct.criterias.answer_relevance" quantifies how much the model's response is relevant to the user's question.
    It has four options that the model can choose from and they are excellent, acceptable, could be improved and bad. Each option also has a description of itself and a score associated with it.
    The model uses these descriptions to identify which option the given response is closest to and returns them.
-   The user can also specify their own custom criteria. An example of this is included under the section **Creating a custom criterion**.
+   The user can also specify their own custom criteria. An example of this is included under the section **Creating a custom criteria**.
    The user can specify more than one criteria too. This is illustrated in the **End to end example** section
 4. The Context fields are the additional fields beyond the evaluated response that are passed to the LLM as judge. This could be the reference answer, the question or the context provided to the model etc.
     In the example below, the question that was input to the model is passed as a context field.
@@ -72,88 +72,88 @@ To accomplish this evaluation, we require the following:
 
 1. The questions that were input to the model
 2. The judge model and its deployment platform
-3. The pre-defined criterion, which in this case is metrics.llm_as_judge.direct.criterias.answer_relevance.
+3. The pre-defined criteria, which in this case is metrics.llm_as_judge.direct.criterias.answer_relevance.
 
-We pass the criterion to the judge model's metric as criteria and the question as the context fields.
+We pass the criteria to the judge model's metric as criteria and the question as the context fields.
 
 .. code-block:: python
 
    data = [
     {"question": "Who is Harry Potter?"},
     {"question": "How can I protect myself from the wind while walking outside?"},
     {"question": "What is a good low cost of living city in the US?"},
-]
+    ]
 
-criterion = "metrics.llm_as_judge.direct.criterias.answer_relevance"
-metrics = [
-    f"metrics.llm_as_judge.direct.rits.llama3_1_70b[criteria={criterion}, context_fields=[question]]"
-]
+    criteria = "metrics.llm_as_judge.direct.criterias.answer_relevance"
+    metrics = [
+    f"metrics.llm_as_judge.direct.rits.llama3_1_70b[criteria={criteria}, context_fields=[question]]"
+    ]
 
-dataset = create_dataset(
-    task="tasks.qa.open", test_set=data, metrics=metrics, split="test"
-)
+    dataset = create_dataset(
+        task="tasks.qa.open", test_set=data, metrics=metrics, split="test"
+    )
 
 Once the metric is created, a dataset is created for the appropriate task.
 
 .. code-block:: python
-    dataset = create_dataset(
-    task="tasks.qa.open", test_set=data, metrics=metrics, split="test"
-)
+
+    dataset = create_dataset(task="tasks.qa.open", test_set=data, metrics=metrics, split="test")
 
 The model's responses are then evaluated by the judge model as follows:
 
 .. code-block:: python
-predictions = [
-    """Harry Potter is a young wizard who becomes famous for surviving an attack by the dark wizard Voldemort, and later embarks on a journey to defeat him and uncover the truth about his past.""",
-    """You can protect yourself from the wind by wearing windproof clothing, layering up, and using accessories like hats, scarves, and gloves to cover exposed skin.""",
-    """A good low-cost-of-living city in the U.S. is San Francisco, California, known for its affordable housing and budget-friendly lifestyle.""",
-]
 
-results = evaluate(predictions=predictions, data=dataset)
+    predictions = [
+        """Harry Potter is a young wizard who becomes famous for surviving an attack by the dark wizard Voldemort, and later embarks on a journey to defeat him and uncover the truth about his past.""",
+        """You can protect yourself from the wind by wearing windproof clothing, layering up, and using accessories like hats, scarves, and gloves to cover exposed skin.""",
+        """A good low-cost-of-living city in the U.S. is San Francisco, California, known for its affordable housing and budget-friendly lifestyle.""",
+    ]
+
+    results = evaluate(predictions=predictions, data=dataset)
 
-print("Global Scores:")
-print(results.global_scores.summary)
+    print("Global Scores:")
+    print(results.global_scores.summary)
 
-print("Instance Scores:")
-print(results.instance_scores.summary)
+    print("Instance Scores:")
+    print(results.instance_scores.summary)
 
 
 Positional Bias
 --------------------------------------------
 Positional bias determines if the judge model favors an option owing to its placement within the list of available options rather than its intrinsic merit.
 Unitxt reports if the judge model has positional bias in the instance level summary.
 
-Creating a custom criterion
+Creating a custom criteria
 -------------------------------------
-As described above, the user can either choose a pre-defined criteria from the catalog or define their own criterion. Below is an example of how the user can define their own criterion.
+As described above, the user can either choose a pre-defined criteria from the catalog or define their own criteria. Below is an example of how the user can define their own criteria.
 The criteria must have options and their descriptions for the judge model to choose from.
 Below is an example where the user mandates that the model respond with the temperature in both Celsius and Fahrenheit. The various possibilities are described in the options and each option is associated with a score that is specified in the score map.
 
 .. code-block:: python
 
-from unitxt.llm_as_judge_constants import  CriteriaWithOptions
-
-criteria = CriteriaWithOptions.from_obj(
-    {
-        "name": "Temperature in Fahrenheit and Celsius",
-        "description": "In the response, if there is a numerical temperature present, is it denominated in both Fahrenheit and Celsius?",
-        "options": [
-            {
-                "name": "Correct",
-                "description": "The temperature reading is provided in both Fahrenheit and Celsius.",
-            },
-            {
-                "name": "Partially Correct",
-                "description": "The temperature reading is provided either in Fahrenheit or Celsius, but not both.",
-            },
-            {
-                "name": "Incorrect",
-                "description": "There is no numerical temperature reading in the response.",
-            },
-        ],
-        "option_map": {"Correct": 1.0, "Partially Correct": 0.5, "Incorrect": 0.0},
-    }
-)
+    from unitxt.llm_as_judge_constants import  CriteriaWithOptions
+
+    criteria = CriteriaWithOptions.from_obj(
+        {
+            "name": "Temperature in Fahrenheit and Celsius",
+            "description": "In the response, if there is a numerical temperature present, is it denominated in both Fahrenheit and Celsius?",
+            "options": [
+                {
+                    "name": "Correct",
+                    "description": "The temperature reading is provided in both Fahrenheit and Celsius.",
+                },
+                {
+                    "name": "Partially Correct",
+                    "description": "The temperature reading is provided either in Fahrenheit or Celsius, but not both.",
+                },
+                {
+                    "name": "Incorrect",
+                    "description": "There is no numerical temperature reading in the response.",
+                },
+            ],
+            "option_map": {"Correct": 1.0, "Partially Correct": 0.5, "Incorrect": 0.0},
+        }
+    )
 
 
 End to end example
@@ -162,6 +162,7 @@ Unitxt can also obtain model's responses for a given dataset and then run LLM-as
 Here, we will get llama-3.2 1B instruct's responses and then evaluate them for answer relevance, coherence and conciseness using llama3_1_70b judge model
 
 .. code-block:: python
+
     criterias = ["answer_relevance", "coherence", "conciseness"]
     metrics = [
     "metrics.llm_as_judge.direct.rits.llama3_1_70b"
@@ -181,6 +182,7 @@ Here, we will get llama-3.2 1B instruct's responses and then evaluate them for a
 We use CrossProviderInferenceEngine for inference.
 
 .. code-block:: python
+
     inference_model = CrossProviderInferenceEngine(
         model="llama-3-2-1b-instruct", provider="watsonx"
     )
@@ -241,6 +243,7 @@ We use CrossProviderInferenceEngine for inference.
         )
 
 .. code-block:: text
+
     Output with 100 examples
 
     Scores for criteria 'answer_relevance'