-
Notifications
You must be signed in to change notification settings - Fork 5
User statistics: proper scoring rule #32
Comments
I would also like this.
Is summing the logs of every probability better than the last probability? If you estimate an event 5 days out, then 4 days out, etc. each of those should pay out, not just the last one. I may be displaying my ignorance here…
Are any of these interesting enough you can turn them into a good idea? |
I find it more natural to use log score with base-2 logs so scores can be interpreted as "bits". And a scoring function with 1+log_2(p) for correct, and 1+log_2(1-p) for incorrect. This is equivalent to subtracting the score for the "blind guesser" in the scoring function. It also has the nice behaviour that higher is better and +ve means doing better than "random". (Interestingly, the score here is also the number of bits better than a "null-compressor" that you could encode the outcomes in.) We have also used Matt's idea (4) above as an ancillary measure we called boldness. The interpretation being that a user with a positive boldness is, on average, tipping with higher probabilities than they should for an optimal score. We have been using this for tipping football results for quite some time. |
I like the idea.
I agree that this proposed scoring function with binary logs seems more natural. In addition to what has been proposed, I think it may be interesting to compare user performance to that of a generalized guesser that computes a probability based on some information available to it, giving the scoring rule . A particular class of functions that I have in mind for is one which combines the probability estimates given by other users. In other words, we have a vector of probability estimates given by other users, with being some probability aggregation function. A quick search yields a highly cited review on combining probability distributions by Clemen and Winkler (1999), with section 2.2.1 being relevant (they give more details in their 1990 paper). There is no universally superior function; the simplest function is a weighted average, which they call the Bernoulli model. By restricting to the probabilities given by other users before a specific user has given his estimate, becomes a guesser that takes into account only information available to that user as he makes his prediction. Another idea is to scale the scoring function based on some measure of the difficulty of the prediction. Again, the obvious data to use are the predictions by other users. I haven't yet looked at the literature, so I don't know what sort of function would be reasonable. (Intuitively, correctly anticipating an event that others have predicted with probabilities {0.1, 0.2, 0.15} is more impressive than one with {0.8, 0.9, 0.85}. But, hmm... this is sort of taken into account by what I proposed above.) |
Google+ discussion: https://plus.google.com/u/0/103530621949492999968/posts/AVk4tGYibVP
I don't really know. Isn't PB currently using only the last prediction?
Well, one could say that of calibration too: with every prediction that isn't perfectly calibrate, you are losing calibration points. (If you blow one 0/100% prediction, no number of predictions will restore your original perfect calibration.)
Sure. But again, this is also true of calibration - all you have to do is not be over or underconfident, and you need some information in order to be calibrated for any decile other than 50%! This is why I included the random guesser: to provide a nicely increasing number people can feel happy or sad about, and one which gives some sort of comparability across users.
I don't know how that would work... you mean, take every prediction of yours by decile and compare it against a random predictor with the base rate of that decile? Not sure that's legit. drpowell: interesting, I didn't know that was what it was equivalent to. Does using base-2 with those scores have a name or is it just generally understood by stats/information-theory folks that that is what one is supposed to be doing? 'Boldness' sounds kind of useful, but a more advanced metric than PB currently has, so I think it'd be better to start with something more immediate. (Ditto for vyu's suggestion.) |
Hmm… good stuff here: http://www.csse.monash.edu.au/~footy/about.shtml |
I would like user pages to include a more precise estimate of a user's quality - using a proper scoring rule. The simplest is the log scoring rule, which is very easy to implement. Here is a version in Haskell, excerpted from my Nootropics essay where I am judging my Adderall predictions:
you 'earn' the logarithm of the probability if you were right, and the logarithm of the negation if you were wrong; he who racks up the fewest negative points wins. We feed in a list and get back a number:
In this case, a blind guesser would guess 50% every time (roughly half the days were Adderall and roughly half were not) so the question is, did the 50% guesser beat me?
So I had a palpable edge over the random guesser, although the sample size is not fantastic.
The best way would be to divide the log score of the equivalent number of 50% guesses - since every prediction on PB.com is a binary prediction - by the user's actual log score. If you scored, say, -15 and the random guesser scored -20, then -20/-15=1.3. Higher is better; if later the random guesser has -25 and you have -17, you did even better since now you earned 1.47.
Scaling this up to PB users seems easy: for the n judged predictions, sum the logs of the last probability, and then divide with n * log 0.5.
Display-wise, this is easy: just report the number. In a user page like http://predictionbook.com/users/gwern one can probably just tack on an additional column to the row: like
The text was updated successfully, but these errors were encountered: