Skip to content

The full pipeline of creating UHGEval hallucination dataset

Notifications You must be signed in to change notification settings

IAAR-Shanghai/UHGEval-dataset

Repository files navigation

The full pipeline of creating UHGEval hallucination dataset

1. Collect the raw news

  • Status: Full data; Avaliable.
  • Data location: ./sources/xinhua/raw/
  • Number: 75 txt files, 737,766 news in total
  • Note: Those data are belong to Xinhua News Agency, and are only used for research purposes.

2. Preprocess the raw news

  • Status: No data; Need to generate using the script.
  • Script: ./sources/xinhua/preprocessor.py
  • Data location: ./sources/xinhua/processed; Use the script to generate the data
  • Number: Retained 25,005 news articles (constituting 3.39% of the raw news).
  • Filtering settings:
    • Only includes news categories such as: '政治', '法律', '军事', '教育', '体育', '经济', '市场', '科学', '技术', '医疗', '卫生', '社会', '文化', '艺术', '娱乐', '天气', '环保', '灾害', '事故' ('Politics', 'Law', 'Military', 'Education', 'Sports', 'Economics', 'Market', 'Science', 'Technology', 'Medical', 'Health', 'Society', 'Culture', 'Art', 'Entertainment', 'Weather', 'Environmental Protection', 'Disaster', 'Accident').
    • The length of newsBeginning + newsRemainder is between [630, 870].
    • newsBeginning has [2, 5] sentences. Note: sentence-ending symbols include "。;:?!"
    • The length of newsBeginning is between [80, 120].

3. Generate candidates

  • Status: No data; Need to generate using the script.
  • Script: ./gen_candidates.py
  • Data location: ./candidates/
  • Number: Retained 17,503 news articles (constituting 70.00% of the preprocessed news).
  • Filtering settings:
    • keywordPrecision is between (0, 1), generally should be between (0.2, 0.6).
    • candidateHallucinatedContinuation consists of only 1 sentence.
    • The length of candidateHallucinatedContinuation is between [20, 70].
    • appearedKeywords has at least 2 keywords.

4. Automatic labelling

5. Use Label Studio to enable human rechecking

Label Studio is a multi-type data labeling and annotation tool with standardized output format.

Relevant files can be found in ./label_studio_annotations/.

5.1 Prepare Label Studio Pre-annotations

5.2 Setup labeling configuration and begin human rechecking

5.3 Export Label Studio JSON annotations

6. Get final hallucination dataset