Human evaluation should be applied to a portion of the dataset #55

yujonglee · 2023-08-29T10:47:13Z

Conflict resolver - if consensus failed, we should collect them and hand them to human
Vibe check https://www.latent.space/p/mosaic-mpt-7b

The vibe-based eval cannot be underrated. … One of our evals was just having a bunch of prompts and watching the answers as the models trained and see if they change. Honestly, I don’t really believe that any of these eval metrics capture what we care about. One of our prompts was “suggest games for a 3-year-old and a 7-year-old to play” and that was a lot more valuable to see how the answer changed during the course of training. — Jonathan Frankle

Provide feedback