Skip to main content

Understanding how Anova relates to regression

Analysis of variance (Anova) models are a special case of multilevel regression models, but Anova, the procedure, has something extra: structure on the regression coefficients.

As I put it in the rejoinder for my 2005 discussion paper:

ANOVA is more important than ever because we are fitting models with many parameters, and these parameters can often usefully be structured into batches. The essence of “ANOVA” (as we see it) is to compare the importance of the batches and to provide a framework for efficient estimation of the individual parameters and related summaries such as comparisons and contrasts. . . .

A statistical model is usually taken to be summarized by a likelihood, or a likelihood and a prior distribution, but we go an extra step by noting that the parameters of a model are typically batched, and we take this batching as an essential part of the model. . . .

A key technical contribution of our paper is to disentangle modeling and inferential summaries. A single multilevel model can yield inference for finite-population and superpopulation inferences. . . .

I summarize:

First, if you are already fitting a complicated model, your inferences can be better understood using the structure of that model.Second, if you have a complicated data structure and are trying to set up a model, it can help to use multilevel modeling—not just a simple units-within-groups structure but a more general approach with crossed factors where appropriate. . . .

I’m sharing this with you now because Josh Miller pointed me to this webpage by Jonas Kristoffer Lindeløv entitled “Common statistical tests are linear models (or: how to teach stats).”

Lindeløv’s explanations are good, and I do think it’s useful for students and practitioners to understand that all these statistical procedures are based on the same class of underlying model. He also notes that the Wilcoxon rank test can be formulated approximately as a linear model on ranks, a point that we put in BDA and which I’ve occasionally blogged (see here and here). It’s good to see these ideas being rediscovered: they’re useful enough that they shouldn’t be trapped within a single book and a few old blog entries.

The point of my post today is to emphasize that it’s not just what model you fit, it’s also how you summarize it. To put it another way, I think the unification of statistical comparisons is taught to everyone in econometrics 101, and indeed this is a key theme of my book with Jennifer, in that we use regression as an organizing principle for applied statistics. (Just to be clear, I’m not claiming that we discovered this. Quite the opposite. I’m saying that we constructed our book in large part based on the understanding we’d gathered from basic ideas in statistics and econometrics that we felt had not fully been integrated into how this material was taught.)

So, it’s well known that all these models are a special case of regression, and that’s why in a good econometrics class they won’t bother teaching Anova, chi-squared tests, etc., they just do regression. My Anova paper demonstrates how the concept of Anova has value, not just from the model (which is just straightforward multilevel linear regression) but because of the structured way the fits are summarized.

For more, go to my Anova article or, for something quicker, these old blog posts:
Anova for economists
A psychology researcher asks: Is Anova dead?
Anova is great—if you interpret it as a way of structuring a model, not if you focus on F tests.

I think these are important points: the connection between the statistical models, and also the extra understanding that arises from batching and summarizing by batch.



from Statistical Modeling, Causal Inference, and Social Science https://ift.tt/2uCmk8t
via IFTTT

Comments

Popular posts from this blog

Using RStudio and LaTeX

(This article was first published on r – Experimental Behaviour , and kindly contributed to R-bloggers) This post will explain how to integrate RStudio and LaTeX, especially the inclusion of well-formatted tables and nice-looking graphs and figures produced in RStudio and imported to LaTeX. To follow along you will need RStudio, MS Excel and LaTeX. Using tikzdevice to insert R Graphs into LaTeX I am a very visual thinker. If I want to understand a concept I usually and subconsciously try to visualise it. Therefore, more my PhD I tried to transport a lot of empirical insights by means of  visualization . These range from histograms, or violin plots to show distributions, over bargraphs including error bars to compare means, to interaction- or conditional effects of regression models. For quite a while it was very tedious to include such graphs in LaTeX documents. I tried several ways, like saving them as pdf and then including them in LaTeX as pdf, or any other file ...

Explaining models with Triplot, part 1

[This article was first published on R in ResponsibleML on Medium , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Explaining models with triplot, part 1 tl;dr Explaining black box models built on correlated features may prove difficult and provide misleading results. R package triplot , part of the DrWhy.AI project, is aiming at facilitating the process of explaining the importance of the whole group of variables, thus solving the problem of correlated features. Calculating the importance of explanatory variables is one of the main tasks of explainable artificial intelligence (XAI). There are a lot of tools at our disposal that helps us with that, like Feature Importance or Shapley values, to name a few. All these methods calculate individual feature importance for each variable separately. The problem arises when features used ...