[Topics and LLM summaries] Automate sanity checks #1879

jucor · 2025-01-21T15:08:14Z

As we are doing “human” evals” of automated topic models and automated summaries, we are eliciting a whole slew of manual sanity checks. See for example:

@akonya's comment [Topic Models] Evaluating topic models #1866 (comment) ,

@colinmegill's instructions for manual reviewers

polis/server/src/report_experimental/README.md

Lines 31 to 84 in c1afc69

    
           ## Checking Your Report during development as a human evaluator 
        
           ### Purpose 
        
           The report generation system is designed to create accurate, evidence-based narratives from Pol.is conversations that fairly represent all participant viewpoints, maintain precision without sacrificing accessibility, support every claim with proper citations, avoid interpretation beyond what the data shows, and present patterns rather than individual statements. When reviewing a generated report, focus on avoiding common LLM pitfalls like hallucinations or misrepresentation. 
        
           ### 1. Citation Integrity 
        
           - Each clause should have 1-5 supporting citations 
        
           - Citations should directly support the claims made 
        
           - Check for over-citation (using multiple citations when one would suffice) 
        
           - Verify citations aren't being used out of context 
        
           ### 2. Statement Accuracy 
        
           - **Descriptive Statements**: When the report describes what "participants discussed", verify the topics exist in the source material 
        
           - **Stance Attribution**: When the report claims participants "emphasized" or "agreed", verify both: 
        
             - The content matches what participants actually said 
        
             - The voting patterns support the claimed stance 
        
           - **Flag Complete Fabrications**: Identify any statements about things not present in the source material 
        
           - **Check for Misrepresentation**: Look for subtle inaccuracies in how statements are characterized 
        
           ### 3. Voting Pattern Verification 
        
           - When "consensus" is claimed, verify agreement across all groups 
        
           - For "broad agreement" claims, check if true group-informed consensus exists 
        
           - Verify any claimed differences between groups match actual voting patterns 
        
           - Watch for cases where disagreement is reported but voting shows agreement (or vice versa) 
        
           ### 4. Group Dynamic Accuracy 
        
           - For group-specific claims (e.g., "Group A showed higher agreement"), verify: 
        
             - The actual voting patterns within that group 
        
             - The comparison to other groups is accurate 
        
           - Check that group characterizations are supported by multiple data points 
        
           - Ensure minority viewpoints aren't misrepresented when discussing consensus 
        
           ### 5. Narrative Flow & Truthfulness 
        
           - Does the report read like a natural story while staying true to the data? 
        
           - Are we jumping between topics in a way that makes sense, or does it feel forced? 
        
           - When we group comments into themes, are we being consistent or getting sloppy? 
        
           - If we say "participants generally felt X", can we back that up with multiple comments/votes? 
        
           - Are we drawing conclusions that actually match what people said and how they voted? 
        
           - Are we implying X caused Y without solid evidence? 
        
           ### Common Red Flags 
        
           - Statements without citations 
        
           - Overly broad generalizations from limited data 
        
           - Single citations used to support multiple unrelated claims 
        
           - Unsupported claims about group differences 
        
           - Mischaracterized voting patterns 
        
           - Solutions or recommendations not present in data

the brainstorm we did back early December 2024 and just publicly documented [LLM] Basic evaluations of LLM outputs #1878 .

Wouldn’t it be nice to automate those sanity checks? They’re also a perfect (and interpretable !) stepping stone to more advanced metrics, for example for topics in #1866.
Also, these will be extremely useful for the Test-Time Compute involving various forms of Best-of-N or accept/reject that we are discussing to improve LLM performance (see an upcoming issue I’m currently drafting).

I believe a lot of those sanity checks can be done with either some exhaustive manual coded checks, or with some simple enough LLM prompting as long as we have simple clauses that lend itself to straightforward verifications. For example:

For topics,, each comment assignment to a topic can be checked: we leverage the fact that checking something is easier than generating it.
For summarization, the format prototyped in https://github.com/compdemocracy/polis/blob/edge/server/src/prompts/report_experimental/subtasks/common/typesReference.xml already requires a breakdown into basic clauses, which will also help for some of the checks.

See also #1878 section “How” and “Data”.
Of course there is the question of humanly-evaluating the LLM-evaluator, or trusting them blindly. For simple enough checks, they can at least give us a sense of confidence, and we can spot check them. While some might say it gives a false sense of security, pragmatically it is still better than trusting the much bigger output! (and if needed, those basic checks can then be reinforced with the same test-time-compute techniques we describe for the larger tasks, in an even simpler way).

jucor added the feature-request For new feature suggestions label Jan 21, 2025

jucor mentioned this issue Jan 23, 2025

[LLM Summarization and Topics] Test-Time Compute: Rejection / Best-of-N #1883

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Topics and LLM summaries] Automate sanity checks #1879

[Topics and LLM summaries] Automate sanity checks #1879

jucor commented Jan 21, 2025

[Topics and LLM summaries] Automate sanity checks #1879

[Topics and LLM summaries] Automate sanity checks #1879

Comments

jucor commented Jan 21, 2025