-
Notifications
You must be signed in to change notification settings - Fork 25
SMEP D: Data Structures for Input to Models
Status: Discussion
from mailing list Tue, May 22, 2012
I'm still getting familiar with STATA, example survival with Kaplan-Meier
Before some estimation commands can be used in Stata, some properties of the dataset have to be declared
- stset -- Declare data to be survival-time data
- tsset -- Declare data to be time-series data
these could define which variable is time, or which are groups, which variable is the censoring indicator, and so on.
At some point (i.e. end of summer) we need to agree on how we want to handle this in statsmodels.
Another place to look for inspiration for this would be the plm
package for R (note: this is not necessarily an endorsement):
http://cran.r-project.org/web/packages/plm/index.html
http://rgm2.lab.nig.ac.jp/RGM2/func.php?rd_id=plm:plm.data
plm is definitely on the wishlist, but I haven't looked at the details
yet, and didn't know about plm.data
Also xtset
and svyset
. I think it will be handled much as we're
handling the time series now. Ie., every time series model takes a
dates parameter. Or if a pandas DataFrame is given we try to do some
magic and infer the dates. I've only thought about panel data and I
imagine this will work much the same. For survey data etc. it's often
just the structure of the covariance that changes but the model
estimation is all the same, so we can reuse the estimator classes,
we'll just want to sub-class them with light wrappers that set more
meta-data. It would be nice to be able to do something like
Survey(OLS(...), clusters='blah')