You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we are doing “human” evals” of automated topic models and automated summaries, we are eliciting a whole slew of manual sanity checks. See for example:
## Checking Your Report during development as a human evaluator
### Purpose
The report generation system is designed to create accurate, evidence-based narratives from Pol.is conversations that fairly represent all participant viewpoints, maintain precision without sacrificing accessibility, support every claim with proper citations, avoid interpretation beyond what the data shows, and present patterns rather than individual statements. When reviewing a generated report, focus on avoiding common LLM pitfalls like hallucinations or misrepresentation.
### 1. Citation Integrity
- Each clause should have 1-5 supporting citations
- Citations should directly support the claims made
- Check for over-citation (using multiple citations when one would suffice)
- Verify citations aren't being used out of context
### 2. Statement Accuracy
-**Descriptive Statements**: When the report describes what "participants discussed", verify the topics exist in the source material
-**Stance Attribution**: When the report claims participants "emphasized" or "agreed", verify both:
- The content matches what participants actually said
- The voting patterns support the claimed stance
-**Flag Complete Fabrications**: Identify any statements about things not present in the source material
-**Check for Misrepresentation**: Look for subtle inaccuracies in how statements are characterized
### 3. Voting Pattern Verification
- When "consensus" is claimed, verify agreement across all groups
- For "broad agreement" claims, check if true group-informed consensus exists
- Verify any claimed differences between groups match actual voting patterns
- Watch for cases where disagreement is reported but voting shows agreement (or vice versa)
### 4. Group Dynamic Accuracy
- For group-specific claims (e.g., "Group A showed higher agreement"), verify:
- The actual voting patterns within that group
- The comparison to other groups is accurate
- Check that group characterizations are supported by multiple data points
- Ensure minority viewpoints aren't misrepresented when discussing consensus
### 5. Narrative Flow & Truthfulness
- Does the report read like a natural story while staying true to the data?
- Are we jumping between topics in a way that makes sense, or does it feel forced?
- When we group comments into themes, are we being consistent or getting sloppy?
- If we say "participants generally felt X", can we back that up with multiple comments/votes?
- Are we drawing conclusions that actually match what people said and how they voted?
- Are we implying X caused Y without solid evidence?
### Common Red Flags
- Statements without citations
- Overly broad generalizations from limited data
- Single citations used to support multiple unrelated claims
- Unsupported claims about group differences
- Mischaracterized voting patterns
- Solutions or recommendations not present in data
Wouldn’t it be nice to automate those sanity checks? They’re also a perfect (and interpretable !) stepping stone to more advanced metrics, for example for topics in #1866.
Also, these will be extremely useful for the Test-Time Compute involving various forms of Best-of-N or accept/reject that we are discussing to improve LLM performance (see an upcoming issue I’m currently drafting).
I believe a lot of those sanity checks can be done with either some exhaustive manual coded checks, or with some simple enough LLM prompting as long as we have simple clauses that lend itself to straightforward verifications. For example:
For topics,, each comment assignment to a topic can be checked: we leverage the fact that checking something is easier than generating it.
See also #1878 section “How” and “Data”.
Of course there is the question of humanly-evaluating the LLM-evaluator, or trusting them blindly. For simple enough checks, they can at least give us a sense of confidence, and we can spot check them. While some might say it gives a false sense of security, pragmatically it is still better than trusting the much bigger output! (and if needed, those basic checks can then be reinforced with the same test-time-compute techniques we describe for the larger tasks, in an even simpler way).
The text was updated successfully, but these errors were encountered:
As we are doing “human” evals” of automated topic models and automated summaries, we are eliciting a whole slew of manual sanity checks. See for example:
polis/server/src/report_experimental/README.md
Lines 31 to 84 in c1afc69
Wouldn’t it be nice to automate those sanity checks? They’re also a perfect (and interpretable !) stepping stone to more advanced metrics, for example for topics in #1866.
Also, these will be extremely useful for the Test-Time Compute involving various forms of Best-of-N or accept/reject that we are discussing to improve LLM performance (see an upcoming issue I’m currently drafting).
I believe a lot of those sanity checks can be done with either some exhaustive manual coded checks, or with some simple enough LLM prompting as long as we have simple clauses that lend itself to straightforward verifications. For example:
See also #1878 section “How” and “Data”.
Of course there is the question of humanly-evaluating the LLM-evaluator, or trusting them blindly. For simple enough checks, they can at least give us a sense of confidence, and we can spot check them. While some might say it gives a false sense of security, pragmatically it is still better than trusting the much bigger output! (and if needed, those basic checks can then be reinforced with the same test-time-compute techniques we describe for the larger tasks, in an even simpler way).
The text was updated successfully, but these errors were encountered: