Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this the expected behavior of patsy when building a design matrix of a two-level categorical variable without an intercept? #80

Open
bsmith89 opened this issue Mar 11, 2016 · 12 comments

Comments

@bsmith89
Copy link

cloned from stackoverflow: http://stackoverflow.com/questions/35944841/is-this-the-expected-behavior-of-patsy-when-building-a-design-matrix-of-a-two-le

(patsy v0.4.1, python 3.5.0)

I would like to use patsy (ideally through statsmodels) to build a design matrix for regression.

The patsy-style formula that I would like to fit is

response ~ 0 + category

where category is a two-level categorical variable. The 0 + ... is supposed to indicate that I do not want the implicit intercept term.

The design matrix that I expect has a single column with zeros and ones indicating whether category has the base-level (0) or the other level (1).

The following code:

import pandas as pd
import patsy

df = pd.DataFrame({'category': ['A', 'B'] * 3})

patsy.dmatrix('0 + category', data=df)

Outputs:

DesignMatrix with shape (6, 2)
  category[A]  category[B]
            1            0
            0            1
            1            0
            0            1
            1            0
            0            1
  Terms:
    'category' (columns 0:2)

which is singular and not what I want.

When I instead run

import pandas as pd
import patsy

df = pd.DataFrame({'category': ['A', 'B'] * 3})

patsy.dmatrix('category', data=df)

the output is

DesignMatrix with shape (6, 2)
  Intercept  category[T.B]
          1              0
          1              1
          1              0
          1              1
          1              0
          1              1
  Terms:
    'Intercept' (column 0)
    'category' (column 1)

which is correct for the model which includes an intercept, but still not what I want.

Is the output without an intercept the intended behavior? If so, why?
Am I just confused about how this design matrix is supposed to work with standard coding?

I know that I can edit the design matrix to make my regression work the way I intend, but if this is a bug I'd like to see it fixed in patsy.

@bsmith89
Copy link
Author

Apparently a duplicate of #60 ?

@njsmith
Copy link
Member

njsmith commented Mar 12, 2016

It's intentional, yes, because this is always what you want for most common types of models -- if you have a categorical variable in your formula, then patsy interprets that to mean that you want the design matrix to allow for the different levels of that category to have different values. In a regular linear regression model, the design matrix you describe would encode the assumption that items that fall into category A should always be predicted to have an exactly zero response, while response for items in category B will be estimated from the data, which is a bit odd :-).

The canonical documentation on how patsy interprets formulas and why is here (though you don't have read this, it's just FYI).

Can you elaborate on what you're trying to do and why?

(There's a comment on #60 mentioning one possible use case for things like this -- Cox proportional hazards models where the intercept is unidentifiable, but I think the right solution for the Cox case is something involving the model itself telling patsy to ensure that the design matrix's output span should always have the intercept removed; I don't know if that would be relevant to your case or not.)

@bsmith89
Copy link
Author

Yes, you've got it. I found this feature trying to fit a proportional hazards model.

My completely uninformed expectation was that the explicit "no intercept" term in the formula would only remove the intercept, even if the result is not generally useful. Seems like the two-dummy, singular case could be a specific coding scheme.

Would a pull request for a coding scheme which does the right thing for PHReg be valuable?

@njsmith
Copy link
Member

njsmith commented Mar 12, 2016

The way I'd like it to work is, design_matrix_builders should take some argument specifying that the model it is creating a DesignInfo for has an implicit intercept. Setting this flag should be the responsibility of the PHReg code, not something that users should think about, so it should be done at this level, not as part of something the user has to specify as part of the formula.

I don't think it should be too complicated -- basically, the bits of the patsy internals that handle the intercept should check for this flag, and if its set, then they skip coding the intercept, but then they still add the intercept to the data structure that says which columns have already been coded. Somewhere in the patsy.build.design_matrix_builders / patsy.build._make_subterm_info interaction is where you'd want to look (plus you'd also want to expose it in at the high-level dmatrix level -- but the highlevel functions are just simple wrappers around design_matrix_builders).

@tyarkoni
Copy link

Also +1 for having an option to exclude all reference categories (as suggested in #60). I understand the reasoning behind wanting to return the incremental contribution of each term, but the fact that numeric and categoricals are handled inconsistently is what threw me for a loop (until I read the docs). It would be useful to have a flag that allows interaction terms to be computed naively, on the dummy-coded factors, and lets the user take responsibility for dealing with multicollinearity.

My use case, for what it's worth, is that the columns of the DesignMatrix are used as inputs to Bayesian hierarchical models. Overparameterization is not a problem in the Bayesian setting, but for interpretative purposes, I definitely want to see posterior estimates for [a1b1, a1b2, a2b1, a2b2], and not [intercept, a2, a1b2, a2b2]. For my purposes it's trivial to explicitly remove the intercept in the formula and then add it later, but it would be nice not to have to handle this as a special case.

If nothing else, it might be helpful to point out in the docs that one can get the naive parameterization by explicitly removing the intercept (i.e., 0 + a * b). I'm happy to open a PR for this if you like.

@vnavkal
Copy link

vnavkal commented Apr 30, 2017

The way I'd like it to work is, design_matrix_builders should take some argument specifying that the model it is creating a DesignInfo for has an implicit intercept.

Would this still be useful? I think I managed to add this feature to design_matrix_builders, and I'm happy to clean it up and submit a pull request.

However, I also see that statsmodels has now added their own workaround to create underparametrized design matrices without intercept: see statsmodels/statsmodels#3095. So maybe this isn't really needed by anyone anymore (in which case perhaps this issue should be closed?).

My personal opinion is that the statsmodels change is a little hacky and this functionality should be implemented by patsy; that way other downstream libraries would also have access to it.

@njsmith
Copy link
Member

njsmith commented May 1, 2017

@vnavkal: I guess we could ask @josef-pkt or @kshedden if statsmodels would benefit from patsy having direct support for intercept-less models like Cox, but my guess is it would be useful. (Not everyone who uses patsy is using statsmodels, for one thing.)

@kshedden
Copy link

kshedden commented May 1, 2017

Yes, the approach we used is not nice, but it seems to work:

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/duration/hazard_regression.py#L407-L419

I'm sure we would refactor to use a native Patsy solution to this if it were available.

@vnavkal
Copy link

vnavkal commented May 1, 2017

OK, thanks for the responses! I'll plan to open a pull request with this change.

@tmurph
Copy link

tmurph commented Jun 26, 2018

@njsmith: Sorry to necrobump this old thread, but are there any plans to merge #107 ? I think it’ll solve another problem I just encountered that seems to be a longstanding issue over with statsmodels.

I’m trying to fit a logistic regression and the LHS of my formula is a binary category. The LHS design matrix that Patsy generates is an (N,2) array of dummy codes, when what I need is an (N,1) array of reduced-rank coding sans reference level.

When I asked about this on stackoverflow I was pointed to statsmodels/issues/2733.

So now that’s two ways this would be useful to statsmodels. I dunno how many other use cases this would support, but it seems fairly valuable already.

@sammosummo
Copy link

sammosummo commented Jul 31, 2019

@njsmith, in reference to this design matrix,

DesignMatrix with shape (6, 2)
  category[A]  category[B]
            1            0
            0            1
            1            0
            0            1
            1            0
            0            1
  Terms:
    'category' (columns 0:2)

you write:

In a regular linear regression model, the design matrix you describe would encode the assumption that items that fall into category A should always be predicted to have an exactly zero response, while response for items in category B will be estimated from the data,

I don't understand why this is true. Assuming there are two regression coefficients in the model, the first coefficient will estimate the mean of category A. In the matrix posted, rows 1, 3 and 5 belong to this category. They all get assigned 1, so will contribute to the estimation of the value of that coefficient. So it doesn't assume these "have exactly 0", it assumes they equal whatever the values of the first coefficient turns out to be. The second coefficient will estimate category B in the same way.

The difference between this design matrix and one where the the first column is replaced with an intercept is that the second coefficient represents the difference between category A and category B, rather than the absolute values of category B. No?

@jianghaochu
Copy link

@sammosummo Not sure if it's still relevant. But I think the design matrix they were talking about was the following:

The design matrix that I expect has a single column with zeros and ones indicating whether category has the base-level (0) or the other level (1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants