[LLM Summarization] Test-time compute to reduce hallucinations in summarization #1881

jucor · 2025-01-22T13:05:31Z

In #1880 I discussed the application for topics categorization of test-time compute to reduce hallucinations, in particular using multi-sample and/or semantic entropy (I’m drafting another issue about other ways). Obviously, we would want to get similar benefits for summarization too!

At first glance summarization seems quite a different topic categorization: much longer answers, with multiple clauses, as opposed to a single precise answer from a pre-defined set of topics. The good news is that we can apply many of the same ideas, by decomposing the summary into multiple simple clauses, as (Farquhar et al. 2024) explains:

Naively, one might simply regenerate each sentence (conditioned on the text so far) and then compute semantic entropy over these regenerations. However, the resampled sentences often target different aspects of the biography: example, one time describing family and the next time profession. This is analogous to the original problem semantic
entropy was designed to resolve: the model is uncertain about the right ordering of facts, not about the facts themselves. To address this, we break down the entire paragraph into factual claims and reconstruct questions which might have been answered by those claims. Only then do we apply semantic entropy.

In more visual form, (Farquhar et al. 2024) proceed as follow:

First, they decompose full paragraphs into a set of factoids
The for each factoid
1. Using an LLM to generate of 3 questions on that factoid
  1. Then for each question (and passing the paragraph as context), use the LLM to generate multiple answers
  2. Then use semantic entropy to check them for hallucinations.

or even more visual and taken from their article Figure 1:

It gets even better, for our case. The JSON format created by @colin and @tim for summarization explicitly separates the summary into clauses:

polis/server/src/prompts/report_experimental/subtasks/common/jsonSchema.xml

Lines 1 to 22 in 19adf7c

    
           <jsonSchema> 
        
             { 
        
             "id": "string e.g. 'subtask_uncertainty'", 
        
             "title": "string e.g. 'Uncertainty Analysis'", 
        
             "paragraphs": [ 
        
             { 
        
             "id": "string e.g. 'uncertainty_overview'", 
        
             "title": "string e.g. 'Differences in Uncertainty Between Groups'", 
        
             "sentences": [ 
        
             { 
        
             "clauses": [ 
        
             { 
        
             "text": "string e.g. 'The uncertainty of group A is higher than group B'", 
        
             "citations": [123] 
        
             } 
        
             ] 
        
             } 
        
             ] 
        
             } 
        
             ] 
        
             } 
        
           </jsonSchema>

And the corresponding TypeScript types:

polis/server/src/prompts/report_experimental/subtasks/common/typesReference.xml

Lines 1 to 26 in 19adf7c

    
           <report_experimentalTypescriptTypesReference> 
        
               type Citation = { 
        
               commentId: number; 
        
               }; 
        
               type Clause = { 
        
               text: string; 
        
               citations: Citation[]; 
        
               }; 
        
               type Sentence = { 
        
               clauses: Clause[]; 
        
               }; 
        
               type Paragraph = { 
        
               id: string; // e.g. "uncertainty_overview" 
        
               title: string; 
        
               sentences: Sentence[]; 
        
               }; 
        
               type Subtask = { 
        
               id: string; 
        
               title: string; 
        
               paragraphs: Paragraph[]; 
        
               }; 
        
           </report_experimentalTypescriptTypesReference>

This simplifies the separation into factoids: each clause (and the associated comments) is a factoid!

Some questions are still open that I have not yet fully answered:

I need to double-check some of the the details: for example, in step 2.a.i, do we give or not the original generated paragraph in addition to the question to get the answer. It’s quite a different exercise, and amounts to checking the entropy of two different conditional distributions. I need to think more about this.
Unlike for topic categorization, and if I understand correctly, this algorithm only detects hallucinations, but does not generate a new correct summary. This is a procedure to check the clauses and knock-out wrong ones, but it does not to generate multiple entire summaries. Could we extend it to get better clauses ?
- Maybe using the multiple answers on the multiple generated questions on which we check entropy, and then recombine that into a new clause to replace the hallucinated clause?
- But there might then be issues about keeping coherence with the other surrounding clauses. Worth investigating nevertheless!

I think this is a promising direction to keep digging :)

Reference:

Farquhar, Sebastian, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. 2024. ‘Detecting Hallucinations in Large Language Models Using Semantic Entropy’. Nature 630 (8017): 625–30. https://doi.org/10.1038/s41586-024-07421-0.

jucor added the feature-request For new feature suggestions label Jan 22, 2025

jucor mentioned this issue Jan 22, 2025

Using multi-sample/semantic-entropy for topic categorization (and for summarization?) Jigsaw-Code/sensemaking-tools#12

Open

jucor changed the title ~~[LLM Summarization] Test-time compute for summarization~~ [LLM Summarization] Test-time compute to reduce hallucinations in summarization Jan 22, 2025

This was referenced Jan 22, 2025

[Topics] Test-Time Compute for Topics: Embrace uncertainty to reduce hallucinations, with multi-sample and/or semantic entropy #1880

Open

[LLM Summarization and Topics] Test-Time Compute: Rejection / Best-of-N #1883

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLM Summarization] Test-time compute to reduce hallucinations in summarization #1881

[LLM Summarization] Test-time compute to reduce hallucinations in summarization #1881

jucor commented Jan 22, 2025 •

edited

Loading

[LLM Summarization] Test-time compute to reduce hallucinations in summarization #1881

[LLM Summarization] Test-time compute to reduce hallucinations in summarization #1881

Comments

jucor commented Jan 22, 2025 • edited Loading

jucor commented Jan 22, 2025 •

edited

Loading