Skip to content

SMEP D: Data Structures for Input to Models

josef-pkt edited this page Jun 20, 2012 · 2 revisions

SMEP-D: Data Structures for Input to Models

Status: Discussion

Discussion: structured data ? time series, survival, panel

from mailing list Tue, May 22, 2012

josef:

I'm still getting familiar with STATA, example survival with Kaplan-Meier

Before some estimation commands can be used in Stata, some properties of the dataset have to be declared

  • stset -- Declare data to be survival-time data
  • tsset -- Declare data to be time-series data

these could define which variable is time, or which are groups, which variable is the censoring indicator, and so on.

At some point (i.e. end of summer) we need to agree on how we want to handle this in statsmodels.

VincentAB:

Another place to look for inspiration for this would be the plm package for R (note: this is not necessarily an endorsement):

http://cran.r-project.org/web/packages/plm/index.html

josef:

http://rgm2.lab.nig.ac.jp/RGM2/func.php?rd_id=plm:plm.data

plm is definitely on the wishlist, but I haven't looked at the details yet, and didn't know about plm.data

Skipper:

Also xtset and svyset. I think it will be handled much as we're handling the time series now. Ie., every time series model takes a dates parameter. Or if a pandas DataFrame is given we try to do some magic and infer the dates. I've only thought about panel data and I imagine this will work much the same. For survey data etc. it's often just the structure of the covariance that changes but the model estimation is all the same, so we can reuse the estimator classes, we'll just want to sub-class them with light wrappers that set more meta-data. It would be nice to be able to do something like

Survey(OLS(...), clusters='blah')