You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As far as I'm aware, there's no easy way to extract the names of the orginal columns used in a patsy formula, see these open tickets here and here. So you have to rely on regular expressions for now.
However the current code does not capture all complex scenarios which can occur in formulas, leading to errors for marginaleffects.
I try to illustrate that in the below code and suggest a potential alternative (which I'm currently relying on): detecting whether any of the data columns, surrounded by word boudaries, occurs in the model formula. It's still not perfect, as it can capture non model terms (such as Treatment, Good, minimum, df, constraints, center for the example below, if these exists as columns in the data), but at least it won't miss any of the predictors.
importreimportnumpyasnpimportpandasaspdimportpolarsasplimportstatsmodels.formula.apiassmffrommarginaleffectsimportpredictionsfrommarginaleffects.sanitize_modelimportsanitize_modeldiamonds=pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/ggplot2/diamonds.csv")
# Complex formula with interaction term only, categorical with custom reference level, and splinemodel=smf.ols("price ~ depth:color + C(cut, Treatment('Good')) + cr(np.minimum(carat, 0.8), df=5, constraints='center')", data=diamonds).fit()
# Fails: ValueError: There is no valid column name in `variables`.predictions(model, newdata=diamonds, by="cut")
# Create ModelStatsmodels objectself=sanitize_model(model)
# Variable list shows up emptyself.get_variables_names()
# Current code: Lines 53-56 in model_statsmodels.pyvariables=self.model.model.exog_namesvariables= [re.sub("\[.*\]", "", x) forxinvariables]
variables= [xforxinvariablesifxinself.modeldata.columns]
variables=pl.Series(variables).unique().to_list()
# []# Proposed codeformula=self.formulacolumns=self.modeldata.columnsvariables=list({varforvarincolumnsifre.search(rf"\b{re.escape(var)}\b", formula)})
# ['price', 'carat', 'cut', 'color', 'depth']
The text was updated successfully, but these errors were encountered:
As far as I'm aware, there's no easy way to extract the names of the orginal columns used in a
patsy
formula, see these open tickets here and here. So you have to rely on regular expressions for now.However the current code does not capture all complex scenarios which can occur in formulas, leading to errors for
marginaleffects
.I try to illustrate that in the below code and suggest a potential alternative (which I'm currently relying on): detecting whether any of the data columns, surrounded by word boudaries, occurs in the model formula. It's still not perfect, as it can capture non model terms (such as
Treatment
,Good
,minimum
,df
,constraints
,center
for the example below, if these exists as columns in the data), but at least it won't miss any of the predictors.The text was updated successfully, but these errors were encountered: