Skip to main content

How to approach a social science research problem when you have data and a couple different ways you could proceed?

tl;dr: Someone asks me a question, I can’t really tell what he’s talking about, so I offer some generic advice.

Joe Hoover writes:

An issue has come up in my subsequent analyses, which uses my MrsP estimates to explore the relationship between county-level moral values and the county-level distribution of hate groups, as defined by the SPLC.

Setting aside issues of spatial auto-correlation, control variables, measurement, and all other potential complications, I want to explore the US county-level association between a county mean outcome X and the county-level distribution of rare-event Y (N Y = 0 is about 2800, N Y > 0 is about 250).

My initial analytical plan included two analyses:

1. Model Y as some zero inflated function of X. I tried this and observed a lot of noise (small effects with estimated with low uncertainty).

2. Employ a case-control design that includes all hate group counties + a random sample of counties without hate groups. This design is based on a recent paper that investigated the county-level distribution of hate groups. When I tried this approach, estimation uncertainty decreased and the effects were in the hypothesized direction (how convenient!).

My issue now is that I have two very different sets of results that rely on two very different designs. It seems to me that they address two different questions, but am not entirely sure what question the second analysis really addresses:

1. If we know X for a given county, does that tell us anything about the expected rate of hate groups in that county. Answer: no.

2. Among counties that…mostly have at least one hate group, does knowing X tell us anything about how the expected rate of hate groups in that county. Answer: yes?

Part of my confusion about how to work with these results derives from the complexity of the DGP: there are probably many counties that would be nice places to start a hate group, but maybe…there are no self-motivated bigots there. Or, the bigots there are introverted and don’t like to be in groups, etc.

I guess I’m thinking of these factors as something analogous to epidemiological exposure. For example, perhaps county-level population density increases the risk contracting a virus at the county level. But, if the virus is rare, estimating a model that includes every county won’t reveal this relationship because most counties were never exposed.

This kind of epidemiological reasoning makes sense to me, but it is outside of my areas of expertise. And, I am also aware that it is probably not a coincidence that the reasoning which justifies the ‘good’ results ‘makes sense’ to me.

Accordingly, I would like to place myself on firmer ground by better understanding the precedents for these different analytical approaches. Specifically, I would like to know if it ever makes sense to use a case-control approach if you have data for the entire world (i.e. in my case, case-control requires throwing out observations, which feels strange). Also, I would like to have a better idea of how to interpret these kind of results.

My reply:

I’m getting confused on the details here so let me try to step back and answer in the abstract. He’s fitting two completely different models to the same data . . . hmmmm, not quite the same data, more like two takes on the same problem.

Thinking about fundamentals . . . I was taught that, when stuck, we should think about statistical problems as prediction problems, with causal inference corresponding to prediction under various potential outcomes. So that’s what I’d do here. Instead of saying that you want to “explore the relationship between county-level moral values and the county-level distribution of hate group,” try to define a more precise question (WWJD), then some of the answers will flow.



from Statistical Modeling, Causal Inference, and Social Science https://ift.tt/2FJEnjq
via IFTTT

Comments

Popular posts from this blog

Explaining models with Triplot, part 1

[This article was first published on R in ResponsibleML on Medium , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Explaining models with triplot, part 1 tl;dr Explaining black box models built on correlated features may prove difficult and provide misleading results. R package triplot , part of the DrWhy.AI project, is aiming at facilitating the process of explaining the importance of the whole group of variables, thus solving the problem of correlated features. Calculating the importance of explanatory variables is one of the main tasks of explainable artificial intelligence (XAI). There are a lot of tools at our disposal that helps us with that, like Feature Importance or Shapley values, to name a few. All these methods calculate individual feature importance for each variable separately. The problem arises when features used ...

The con behind every wedding

With her marriage on the rocks, one writer struggles to reconcile her cynicism about happily-ever-after as her own children rush to tie the knot A lavish wedding, a couple in love; romance was in the air, as it should be when two people are getting married. But on the top table, the mothers of the happy pair were bonding over their imminent plans for … divorce. That story was told to me by the mother of the bride. The wedding in question was two summers ago: she is now divorced, and the bridegroom’s parents are separated. “We couldn’t but be aware of the crushing irony of the situation,” said my friend. “There we were, celebrating our children’s marriage, while plotting our own escapes from relationships that had long ago gone sour, and had probably been held together by our children. Now they were off to start their lives together, we could be off, too – on our own, or in search of new partners.” Continue reading... The Guardian http://ift.tt/2xZTguV October 07, 2017 at 09:00AM