mmdt/perspectives/hallucination/README.md

# Hallucination Perspective

This directory contains scripts for generating and evaluating hallucinations in both text-to-image and image-to-text formats as part of the MMDT project.

## Directory Structure

- `generate_image_to_text.py`: Script to generate text descriptions from images.
- `generate_text_to_image.py`: Script to generate images from text prompts.
- `eval_image_to_text.py`: Script to evaluate the generated text descriptions.
- `eval_text_to_image.py`: Script to evaluate the generated images.
- `utils.py`: Utility functions used across the generation and evaluation scripts.

## Usage

### Generating Results

We provide an off-the-shelf script for evaluating the overall hallucination perspective on image-to-text and text-to-image modalities.
```
bash scripts/hallucination_i2t.sh <model_id>
bash scripts/hallucination_t2i.sh <model_id>
```

To generate results, we recommend you use the main endpoint of MMDT, where you can specify the model, scenario, and task you want to execute. Here are the command-line arguments you need to provide:

```
python mmdt/main.py --modality image_to_text --model_id <model_id> --perspectives hallucination --scenario <scenario> --task <task>
python mmdt/main.py --modality text_to_image --model_id <model_id> --perspectives hallucination --scenario <scenario> --task <task>
```

Alternatively, you can run the following scripts to exclusively evaluate hallucination perspective on image-to-text and text-to-image modalities.
```
python generate_image_to_text.py --model_id <model_id> --scenario <scenario> --task <task>
python generate_text_to_image.py --model_id <model_id> --scenario <scenario> --task <task>
```

The full list of scenario and task on image_to_text is show below (scenario: task):

```
natural: attribute, count, identification, spatial, action
distraction: attribute, count, identification, spatial, action
counterfactual: attribute, count, identification, spatial, action
cooccurrence: attribute, count, identification, spatial, action
misleading: attribute, count, identification, spatial, action
ocr: contradictory, cooccur, doc, scene
```

While, the full list of scenario and task on text_to_image is show below (scenario: task):

```
natural: attribute, count, identification, spatial
distraction: attribute, count, identification, spatial
counterfactual: attribute, count, identification, spatial
cooccurrence: attribute, count, identification, spatial
misleading: attribute, count, identification, spatial
ocr: complex, contradictory, distortion, misleading
```

### Evaluating Results

To evaluate the results generated by the above scripts, use the following commands with the required model:

```
python eval_image_to_text.py --model_id <model_id> --scenario <scenario> --task <task>
python eval_text_to_image.py --model_id <model_id> --scenario <scenario> --task <task>
```

#### Arguments

- `--model_id`: Model ID whose results are to be evaluated (required).
- `--scenario`: Scenario type, defaults to 'natural'.
- `--task`: Type of task to be evaluated, defaults to 'identification'.


### An example of the summarized output:

Aggregated results:
```json
{
    "adv": null,
    "fairness": null,
    "hallucination": {
        "image-to-text": {
            "InternVL2-8B": {
                "cooccurrence": 0.4076190476190476,
                "counterfactual": 0.382,
                "distraction": 0.5619999999999999,
                "misleading": 0.782,
                "natural": 0.186,
                "ocr": 0.19399999999999998
            },
            "llava-hf_llava-v1.6-mistral-7b-hf": {
                "cooccurrence": 0.39238095238095233,
                "counterfactual": 0.19066666666666668,
                "distraction": 0.45933333333333337,
                "misleading": 0.608,
                "natural": 0.11066666666666666,
                "ocr": 0.096
            }
        },
        "text-to-image": {
            "stable-diffusion-2": {
                "cooccurrence": 0.2634436649754194,
                "counterfactual": 0.14338333333333333,
                "distraction": 0.3107047619047619,
                "misleading": 0.2786666666666667,
                "natural": 0.16740000000000002,
                "ocr": 0.06333333333333332
            }
        }
    },
    "ood": null,
    "privacy": null,
    "safety": null
}
```

Breakdown results:
```json
{
    "adv": null,
    "fairness": null,
    "hallucination": {
        "image-to-text": {
            "InternVL2-8B": {
                "cooccurrence": {
                    "action": 0.14285714285714285,
                    "attribute": 0.42857142857142855,
                    "count": 0.2857142857142857,
                    "identification": 0.4666666666666667,
                    "spatial": 0.7142857142857143
                },
                "counterfactual": {
                    "attribute": 0.184,
                    "count": 0.608,
                    "identification": 0.576,
                    "spatial": 0.16
                },
                "distraction": {
                    "action": 0.44,
                    "attribute": 0.57,
                    "count": 0.63,
                    "identification": 0.71,
                    "spatial": 0.46
                },
                "misleading": {
                    "action": 0.5800000000000001,
                    "attribute": 0.94,
                    "count": 0.8200000000000001,
                    "identification": 0.81,
                    "spatial": 0.76
                },
                "natural": {
                    "action": 0.07,
                    "attribute": 0.07,
                    "count": 0.5,
                    "identification": 0.18,
                    "spatial": 0.11
                },
                "ocr": {
                    "contradictory": 0.256,
                    "cooccur": 0.07199999999999995,
                    "doc": 0.21599999999999997,
                    "scene": 0.23199999999999998
                }
            },
            "llava-hf_llava-v1.6-mistral-7b-hf": {
                "cooccurrence": {
                    "action": 0.2857142857142857,
                    "attribute": 0.2857142857142857,
                    "count": 0.14285714285714285,
                    "identification": 0.5333333333333333,
                    "spatial": 0.7142857142857143
                },
                "counterfactual": {
                    "attribute": 0.04,
                    "count": 0.6186666666666667,
                    "identification": 0.104,
                    "spatial": 0.0
                },
                "distraction": {
                    "action": 0.44,
                    "attribute": 0.52,
                    "count": 0.6266666666666667,
                    "identification": 0.71,
                    "spatial": 0.0
                },
                "misleading": {
                    "action": 0.43999999999999995,
                    "attribute": 0.81,
                    "count": 0.38,
                    "identification": 0.85,
                    "spatial": 0.56
                },
                "natural": {
                    "action": 0.06,
                    "attribute": 0.01,
                    "count": 0.41333333333333333,
                    "identification": 0.07,
                    "spatial": 0.0
                },
                "ocr": {
                    "contradictory": 0.128,
                    "cooccur": 0.0,
                    "doc": 0.09599999999999997,
                    "scene": 0.16000000000000003
                }
            }
        },
        "text-to-image": {
            "stable-diffusion-2": {
                "cooccurrence": {
                    "attribute": 0.48549421152160877,
                    "count": 0.0,
                    "identification": 0.3939885093822475,
                    "spatial": 0.17429193899782133
                },
                "counterfactual": {
                    "attribute": 0.048,
                    "count": 0.15466666666666665,
                    "identification": 0.36286666666666667,
                    "spatial": 0.008
                },
                "distraction": {
                    "attribute": 0.504,
                    "count": 0.19199999999999998,
                    "identification": 0.5308190476190476,
                    "spatial": 0.016
                },
                "misleading": {
                    "attribute": 0.45866666666666667,
                    "count": 0.06666666666666665,
                    "identification": 0.5173333333333333,
                    "spatial": 0.07199999999999995
                },
                "natural": {
                    "attribute": 0.216,
                    "count": 0.11466666666666667,
                    "identification": 0.3389333333333334,
                    "spatial": 0.0
                },
                "ocr": {
                    "complex": 0.07466666666666666,
                    "contradictory": 0.053333333333333344,
                    "distortion": 0.06399999999999995,
                    "misleading": 0.06133333333333335
                }
            }
        }
    },
    "ood": null,
    "privacy": null,
    "safety": null
}
```