Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Sample Prediction for Wordfish #5

Open
muhark opened this issue Jul 25, 2019 · 2 comments
Open

Out of Sample Prediction for Wordfish #5

muhark opened this issue Jul 25, 2019 · 2 comments

Comments

@muhark
Copy link

muhark commented Jul 25, 2019

OOS Predict Wordfish

Hi, two (related) questions. It says in the documentation for textmodel_wordfish that out-of-sample prediction is not currently supported; does this mean that the feature may be added in the future? If so, I'd be happy to submit a pull request/get involved in implementing it.
Second question; does the fitting of wordfish require that there are no features with zero occurrence? I imagine that if we can fit the model with the union of both corpuses (corpi?) from the training and prediction set, then could it be that the prediction task would be as trivial as "plugging in" beta and psi then recovering theta (and alpha) accordingly?

Sorry if this question has been asked before/if it's silly; I've only just started working with quanteda.

@kbenoit
Copy link
Contributor

kbenoit commented Jul 26, 2019

Wordfish cannot fit zero-occurrence features, as there are undefined. (There are an infinite number of zero-occurrence features that could be counted otherwise.) But an OOS prediction method could do the same as the predict methods for other textmodel_*() functions, and make the newdata dfm conform to that from the fitted model. (This would mean considering features present in x but not in newdata as occurring zero times in newdata, and dropping features present in newdata but not in x.)

@conjugateprior I think has implemented a predict method for wordfish already. Thoughts, Will?

@conjugateprior
Copy link

conjugateprior commented Aug 8, 2019

Getting point estimates might well be 'as trivial as "plugging in" beta and psi then recovering theta (and alpha) accordingly', although you'd probably want to flip to multinomial form first.

There's a bit more work to be done about getting prediction intervals though. The easiest asymptotic standard errors for item parameters would assume that the ideal points are perfectly measured (which seems not altogether unreasonable for very large vocabularies) and could be constructed numerically, then you could sample to get uncertainty around the ideal points for OOS docs.

The practical issues would be deciding what to do if the two dfms had different preprocessing.

@kbenoit kbenoit transferred this issue from quanteda/quanteda Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants