Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Topics and LLM summaries] Automate sanity checks #1879

Open
jucor opened this issue Jan 21, 2025 · 0 comments
Open

[Topics and LLM summaries] Automate sanity checks #1879

jucor opened this issue Jan 21, 2025 · 0 comments
Labels
feature-request For new feature suggestions

Comments

@jucor
Copy link
Contributor

jucor commented Jan 21, 2025

As we are doing “human” evals” of automated topic models and automated summaries, we are eliciting a whole slew of manual sanity checks. See for example:

  • @akonya's comment [Topic Models] Evaluating topic models #1866 (comment) ,
  • @colinmegill's instructions for manual reviewers
    ## Checking Your Report during development as a human evaluator
    ### Purpose
    The report generation system is designed to create accurate, evidence-based narratives from Pol.is conversations that fairly represent all participant viewpoints, maintain precision without sacrificing accessibility, support every claim with proper citations, avoid interpretation beyond what the data shows, and present patterns rather than individual statements. When reviewing a generated report, focus on avoiding common LLM pitfalls like hallucinations or misrepresentation.
    ### 1. Citation Integrity
    - Each clause should have 1-5 supporting citations
    - Citations should directly support the claims made
    - Check for over-citation (using multiple citations when one would suffice)
    - Verify citations aren't being used out of context
    ### 2. Statement Accuracy
    - **Descriptive Statements**: When the report describes what "participants discussed", verify the topics exist in the source material
    - **Stance Attribution**: When the report claims participants "emphasized" or "agreed", verify both:
    - The content matches what participants actually said
    - The voting patterns support the claimed stance
    - **Flag Complete Fabrications**: Identify any statements about things not present in the source material
    - **Check for Misrepresentation**: Look for subtle inaccuracies in how statements are characterized
    ### 3. Voting Pattern Verification
    - When "consensus" is claimed, verify agreement across all groups
    - For "broad agreement" claims, check if true group-informed consensus exists
    - Verify any claimed differences between groups match actual voting patterns
    - Watch for cases where disagreement is reported but voting shows agreement (or vice versa)
    ### 4. Group Dynamic Accuracy
    - For group-specific claims (e.g., "Group A showed higher agreement"), verify:
    - The actual voting patterns within that group
    - The comparison to other groups is accurate
    - Check that group characterizations are supported by multiple data points
    - Ensure minority viewpoints aren't misrepresented when discussing consensus
    ### 5. Narrative Flow & Truthfulness
    - Does the report read like a natural story while staying true to the data?
    - Are we jumping between topics in a way that makes sense, or does it feel forced?
    - When we group comments into themes, are we being consistent or getting sloppy?
    - If we say "participants generally felt X", can we back that up with multiple comments/votes?
    - Are we drawing conclusions that actually match what people said and how they voted?
    - Are we implying X caused Y without solid evidence?
    ### Common Red Flags
    - Statements without citations
    - Overly broad generalizations from limited data
    - Single citations used to support multiple unrelated claims
    - Unsupported claims about group differences
    - Mischaracterized voting patterns
    - Solutions or recommendations not present in data
  • the brainstorm we did back early December 2024 and just publicly documented [LLM] Basic evaluations of LLM outputs #1878 .

Wouldn’t it be nice to automate those sanity checks? They’re also a perfect (and interpretable !) stepping stone to more advanced metrics, for example for topics in #1866.
Also, these will be extremely useful for the Test-Time Compute involving various forms of Best-of-N or accept/reject that we are discussing to improve LLM performance (see an upcoming issue I’m currently drafting).

I believe a lot of those sanity checks can be done with either some exhaustive manual coded checks, or with some simple enough LLM prompting as long as we have simple clauses that lend itself to straightforward verifications. For example:

See also #1878 section “How” and “Data”.
Of course there is the question of humanly-evaluating the LLM-evaluator, or trusting them blindly. For simple enough checks, they can at least give us a sense of confidence, and we can spot check them. While some might say it gives a false sense of security, pragmatically it is still better than trusting the much bigger output! (and if needed, those basic checks can then be reinforced with the same test-time-compute techniques we describe for the larger tasks, in an even simpler way).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request For new feature suggestions
Projects
None yet
Development

No branches or pull requests

1 participant