Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLM Summarization] Test-time compute to reduce hallucinations in summarization #1881

Open
jucor opened this issue Jan 22, 2025 · 0 comments
Labels
feature-request For new feature suggestions

Comments

@jucor
Copy link
Contributor

jucor commented Jan 22, 2025

In #1880 I discussed the application for topics categorization of test-time compute to reduce hallucinations, in particular using multi-sample and/or semantic entropy (I’m drafting another issue about other ways). Obviously, we would want to get similar benefits for summarization too!

At first glance summarization seems quite a different topic categorization: much longer answers, with multiple clauses, as opposed to a single precise answer from a pre-defined set of topics. The good news is that we can apply many of the same ideas, by decomposing the summary into multiple simple clauses, as (Farquhar et al. 2024) explains:

Naively, one might simply regenerate each sentence (conditioned on the text so far) and then compute semantic entropy over these regenerations. However, the resampled sentences often target different aspects of the biography: example, one time describing family and the next time profession. This is analogous to the original problem semantic
entropy was designed to resolve: the model is uncertain about the right ordering of facts, not about the facts themselves. To address this, we break down the entire paragraph into factual claims and reconstruct questions which might have been answered by those claims. Only then do we apply semantic entropy.

In more visual form, (Farquhar et al. 2024) proceed as follow:

  1. First, they decompose full paragraphs into a set of factoids
  2. The for each factoid
    1. Using an LLM to generate of 3 questions on that factoid
      1. Then for each question (and passing the paragraph as context), use the LLM to generate multiple answers
      2. Then use semantic entropy to check them for hallucinations.

or even more visual and taken from their article Figure 1:

Image

It gets even better, for our case. The JSON format created by @colin and @tim for summarization explicitly separates the summary into clauses:

<jsonSchema>
{
"id": "string e.g. 'subtask_uncertainty'",
"title": "string e.g. 'Uncertainty Analysis'",
"paragraphs": [
{
"id": "string e.g. 'uncertainty_overview'",
"title": "string e.g. 'Differences in Uncertainty Between Groups'",
"sentences": [
{
"clauses": [
{
"text": "string e.g. 'The uncertainty of group A is higher than group B'",
"citations": [123]
}
]
}
]
}
]
}
</jsonSchema>

And the corresponding TypeScript types:
<report_experimentalTypescriptTypesReference>
type Citation = {
commentId: number;
};
type Clause = {
text: string;
citations: Citation[];
};
type Sentence = {
clauses: Clause[];
};
type Paragraph = {
id: string; // e.g. "uncertainty_overview"
title: string;
sentences: Sentence[];
};
type Subtask = {
id: string;
title: string;
paragraphs: Paragraph[];
};
</report_experimentalTypescriptTypesReference>

This simplifies the separation into factoids: each clause (and the associated comments) is a factoid!

Some questions are still open that I have not yet fully answered:

  • I need to double-check some of the the details: for example, in step 2.a.i, do we give or not the original generated paragraph in addition to the question to get the answer. It’s quite a different exercise, and amounts to checking the entropy of two different conditional distributions. I need to think more about this.
  • Unlike for topic categorization, and if I understand correctly, this algorithm only detects hallucinations, but does not generate a new correct summary. This is a procedure to check the clauses and knock-out wrong ones, but it does not to generate multiple entire summaries. Could we extend it to get better clauses ?
    • Maybe using the multiple answers on the multiple generated questions on which we check entropy, and then recombine that into a new clause to replace the hallucinated clause?
    • But there might then be issues about keeping coherence with the other surrounding clauses. Worth investigating nevertheless!

I think this is a promising direction to keep digging :)

Reference:

Farquhar, Sebastian, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. 2024. ‘Detecting Hallucinations in Large Language Models Using Semantic Entropy’. Nature 630 (8017): 625–30. https://doi.org/10.1038/s41586-024-07421-0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request For new feature suggestions
Projects
None yet
Development

No branches or pull requests

1 participant