Out of Sample Prediction for Wordfish #5

muhark · 2019-07-25T19:44:22Z

OOS Predict Wordfish

Hi, two (related) questions. It says in the documentation for textmodel_wordfish that out-of-sample prediction is not currently supported; does this mean that the feature may be added in the future? If so, I'd be happy to submit a pull request/get involved in implementing it.
Second question; does the fitting of wordfish require that there are no features with zero occurrence? I imagine that if we can fit the model with the union of both corpuses (corpi?) from the training and prediction set, then could it be that the prediction task would be as trivial as "plugging in" beta and psi then recovering theta (and alpha) accordingly?

Sorry if this question has been asked before/if it's silly; I've only just started working with quanteda.

The text was updated successfully, but these errors were encountered:

kbenoit · 2019-07-26T15:18:17Z

Wordfish cannot fit zero-occurrence features, as there are undefined. (There are an infinite number of zero-occurrence features that could be counted otherwise.) But an OOS prediction method could do the same as the predict methods for other textmodel_*() functions, and make the newdata dfm conform to that from the fitted model. (This would mean considering features present in x but not in newdata as occurring zero times in newdata, and dropping features present in newdata but not in x.)

@conjugateprior I think has implemented a predict method for wordfish already. Thoughts, Will?

conjugateprior · 2019-08-08T22:43:39Z

Getting point estimates might well be 'as trivial as "plugging in" beta and psi then recovering theta (and alpha) accordingly', although you'd probably want to flip to multinomial form first.

There's a bit more work to be done about getting prediction intervals though. The easiest asymptotic standard errors for item parameters would assume that the ideal points are perfectly measured (which seems not altogether unreasonable for very large vocabularies) and could be constructed numerically, then you could sample to get uncertainty around the ideal points for OOS docs.

The practical issues would be deciding what to do if the two dfms had different preprocessing.

kbenoit transferred this issue from quanteda/quanteda Feb 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of Sample Prediction for Wordfish #5

Out of Sample Prediction for Wordfish #5

muhark commented Jul 25, 2019

kbenoit commented Jul 26, 2019

conjugateprior commented Aug 8, 2019 •

edited

Loading

Out of Sample Prediction for Wordfish #5

Out of Sample Prediction for Wordfish #5

Comments

muhark commented Jul 25, 2019

OOS Predict Wordfish

kbenoit commented Jul 26, 2019

conjugateprior commented Aug 8, 2019 • edited Loading

conjugateprior commented Aug 8, 2019 •

edited

Loading