Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrong type inference for square terms #244

Open
githubtomtom opened this issue Oct 8, 2021 · 3 comments
Open

wrong type inference for square terms #244

githubtomtom opened this issue Oct 8, 2021 · 3 comments

Comments

@githubtomtom
Copy link

please read this.

Currently, formulas like @formula(y ~ 1 + a * (a + b) + b * b) would fail. We need to explicitly state the square terms like @formula(y ~ 1 + a + (a ^ 2) + (a & b) + b + (b ^ 2)).

It's because now term like a & a would has type InteractionTerm{ContinuousTerm{Float64}} rather than InteractionTerm{Tuple{ContinuousTerm{Float64}, ContinuousTerm{Float64}}}.

Note that currently InteractionTerm{ContinuousTerm{Float64}} can not be properly handled.

@kleinschmidt
Copy link
Member

Thanks for the report and the careful investigation! I think there's two things going on here.

  1. As you rightly point out, statsmodels chokes on a & a. That's an important edge case to consider and I think it is handled appropriately in FunctionTerm is dead, long live FunctionTerm #183.
  2. @formula(y ~ 1 + a * (a + b) + b * b) is not and will likely never be equivalent to @formula(y ~ 1 + a + (a ^ 2) + (a & b) + b + (b ^ 2)), and that's because * has a non-standard interpretation in the wilkinson notation that statsmodels.jl is based on. a * b is equivalent to a + b + a&b.

Now that I think about it a bit more, I think what you're trying to do is pun on a & b being data[:a] .* data[:b] when both a and b are continuous, in which case a & a would be teh same as data[:a] .^ 2. But this is only true when modelcols creates a single column for whatever kind of term a becomes. In the other obvious case where a is a categorical term, you're going to end up with a row-wise kronecker product of the contrast coded data, which is going to produce a wildly overcomplete model matrix that GLM etc. will choke on. So in the spirit of taking away as many footguns as possible I think this kind of punning is a bad idea and a&a should always become a.

@githubtomtom
Copy link
Author

githubtomtom commented Oct 8, 2021

I'm considering only continuous data. So, a & a means a ^ 2 to me.
The current situation is, for continuous data, a * (a + b) => a*a + a*b => a + a&a + a + b + a&b => a + b + a&a + a&b, and it is exactly what I want! But unfortunately a&a cannot be handled correctly right now.

I see your point that a&a should be translated into a. As a cross-check, I do the following in R:

> df = data.frame(a = 1:3, b = 2:4, y = 3:5)
> lm(y ~ 1 + a * (a + b), df)

Call:
lm(formula = y ~ 1 + a * (a + b), data = df)

Coefficients:
(Intercept)            a            b          a:b  
  2.000e+00    1.000e+00           NA    2.719e-16  

... and the result is: NO a&a term in R.....

so, maybe the "correct" way to specify a square term is to explicitly write it as a ^ 2 .... (disappointing though...)

@kleinschmidt
Copy link
Member

kleinschmidt commented Oct 8, 2021

Yup, if you want a .^ 2 then a ^ 2 is the way to do it. If you want both a and a^2, that starts to sound more like a polynomial basis set...the idea with StatsModels is that users/developers can create their own packages that, when loaded, provide special syntax to do things like that, and a few people have kicked around the idea of making a "basis functions" package that provides that interface.

As a toy example, you might also be interested in looking at the example from the docs about how to create a "poly term" that generates [a, a^2, a^3, ..., a^n] from @formula(0 ~ poly(a, n)), and lets you do things like

julia> fit(LinearModel, @formula(y ~ 1 + poly(a,2) + poly(b,2)), sim_dat)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + poly(a, 2) + poly(b, 2)

Coefficients:
──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   0.89288    0.181485    4.92    <1e-5    0.532586    1.25317
a^1           2.73324    0.349194    7.83    <1e-11   2.04001     3.42648
a^2          -1.0114     1.34262    -0.75    0.4531  -3.67684     1.65404
b^1           0.214424   0.136868    1.57    0.1205  -0.0572944   0.486142
b^2           3.15133    0.0811794  38.82    <1e-59   2.99016     3.31249
──────────────────────────────────────────────────────────────────────────

or b * poly(a, 2) -> poly(a,2) + b + poly(a,2) & b which would be equivalent to (b * (a + a^2))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants