-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is this the expected behavior of patsy when building a design matrix of a two-level categorical variable without an intercept? #80
Comments
Apparently a duplicate of #60 ? |
It's intentional, yes, because this is always what you want for most common types of models -- if you have a categorical variable in your formula, then patsy interprets that to mean that you want the design matrix to allow for the different levels of that category to have different values. In a regular linear regression model, the design matrix you describe would encode the assumption that items that fall into category The canonical documentation on how patsy interprets formulas and why is here (though you don't have read this, it's just FYI). Can you elaborate on what you're trying to do and why? (There's a comment on #60 mentioning one possible use case for things like this -- Cox proportional hazards models where the intercept is unidentifiable, but I think the right solution for the Cox case is something involving the model itself telling patsy to ensure that the design matrix's output span should always have the intercept removed; I don't know if that would be relevant to your case or not.) |
Yes, you've got it. I found this feature trying to fit a proportional hazards model. My completely uninformed expectation was that the explicit "no intercept" term in the formula would only remove the intercept, even if the result is not generally useful. Seems like the two-dummy, singular case could be a specific coding scheme. Would a pull request for a coding scheme which does the right thing for PHReg be valuable? |
The way I'd like it to work is, I don't think it should be too complicated -- basically, the bits of the patsy internals that handle the intercept should check for this flag, and if its set, then they skip coding the intercept, but then they still add the intercept to the data structure that says which columns have already been coded. Somewhere in the |
Also +1 for having an option to exclude all reference categories (as suggested in #60). I understand the reasoning behind wanting to return the incremental contribution of each term, but the fact that numeric and categoricals are handled inconsistently is what threw me for a loop (until I read the docs). It would be useful to have a flag that allows interaction terms to be computed naively, on the dummy-coded factors, and lets the user take responsibility for dealing with multicollinearity. My use case, for what it's worth, is that the columns of the DesignMatrix are used as inputs to Bayesian hierarchical models. Overparameterization is not a problem in the Bayesian setting, but for interpretative purposes, I definitely want to see posterior estimates for [a1b1, a1b2, a2b1, a2b2], and not [intercept, a2, a1b2, a2b2]. For my purposes it's trivial to explicitly remove the intercept in the formula and then add it later, but it would be nice not to have to handle this as a special case. If nothing else, it might be helpful to point out in the docs that one can get the naive parameterization by explicitly removing the intercept (i.e., |
Would this still be useful? I think I managed to add this feature to However, I also see that statsmodels has now added their own workaround to create underparametrized design matrices without intercept: see statsmodels/statsmodels#3095. So maybe this isn't really needed by anyone anymore (in which case perhaps this issue should be closed?). My personal opinion is that the statsmodels change is a little hacky and this functionality should be implemented by patsy; that way other downstream libraries would also have access to it. |
@vnavkal: I guess we could ask @josef-pkt or @kshedden if statsmodels would benefit from patsy having direct support for intercept-less models like Cox, but my guess is it would be useful. (Not everyone who uses patsy is using statsmodels, for one thing.) |
Yes, the approach we used is not nice, but it seems to work: I'm sure we would refactor to use a native Patsy solution to this if it were available. |
OK, thanks for the responses! I'll plan to open a pull request with this change. |
@njsmith: Sorry to necrobump this old thread, but are there any plans to merge #107 ? I think it’ll solve another problem I just encountered that seems to be a longstanding issue over with statsmodels. I’m trying to fit a logistic regression and the LHS of my formula is a binary category. The LHS design matrix that Patsy generates is an (N,2) array of dummy codes, when what I need is an (N,1) array of reduced-rank coding sans reference level. When I asked about this on stackoverflow I was pointed to statsmodels/issues/2733. So now that’s two ways this would be useful to statsmodels. I dunno how many other use cases this would support, but it seems fairly valuable already. |
@njsmith, in reference to this design matrix,
you write:
I don't understand why this is true. Assuming there are two regression coefficients in the model, the first coefficient will estimate the mean of category A. In the matrix posted, rows 1, 3 and 5 belong to this category. They all get assigned The difference between this design matrix and one where the the first column is replaced with an intercept is that the second coefficient represents the difference between category A and category B, rather than the absolute values of category B. No? |
@sammosummo Not sure if it's still relevant. But I think the design matrix they were talking about was the following:
|
cloned from stackoverflow: http://stackoverflow.com/questions/35944841/is-this-the-expected-behavior-of-patsy-when-building-a-design-matrix-of-a-two-le
(patsy v0.4.1, python 3.5.0)
I would like to use patsy (ideally through statsmodels) to build a design matrix for regression.
The patsy-style formula that I would like to fit is
where category is a two-level categorical variable. The
0 + ...
is supposed to indicate that I do not want the implicit intercept term.The design matrix that I expect has a single column with zeros and ones indicating whether
category
has the base-level (0) or the other level (1).The following code:
Outputs:
which is singular and not what I want.
When I instead run
the output is
which is correct for the model which includes an intercept, but still not what I want.
Is the output without an intercept the intended behavior? If so, why?
Am I just confused about how this design matrix is supposed to work with standard coding?
I know that I can edit the design matrix to make my regression work the way I intend, but if this is a bug I'd like to see it fixed in patsy.
The text was updated successfully, but these errors were encountered: