Skip to main content

DUIW 420: offering up 20 paper ideas pre-approved for prestigious journals

First, you have to read till the end for the 20 paper ideas.  And if you're wondering about the acronym, it's Driving Under the Influence of Weed on 420 Day, which I learned from Andrew Gelman's blog is a day of celebration of cannabis. Andrew's blog post is about the exemplary work done by Sam Harper and Adam Palayew, debunking a highly-publicized JAMA study that claimed that 420 Day is responsible for a 12 percent increase in fatal car crashes. The discussion provides great fodder for examining how to investigate observational data, which is what most of Big Data is about. It is a cautionary tale for what not to do. *** The blog begins with Harper/Palayew channeling Staples/Redelmeier, the authors of the study: "fatal motor vehicle crashes increase by 12% after 4:20 pm on April 20th (an annual cannabis celebration)." This short sentence captures the gist of the original study but it omits an important detail: to what is the increase relative? If we ran an experiment, we would recruit a group of drivers, and select half of them at random to smoke weed on April 20. Then, we would count what proportion of drivers suffered fatal car crashes after 4:20 pm. The analysis would be straightforward: what's the difference in proportions between the two groups? With such an experiment, it is possible to draw a causal conclusion. Alternatively, we could conduct a case-control study. The cases are the drivers who suffered fatal car crashes on April 20. We collect demographic data on these drivers. Then, we define a set of "controls", drivers who did not suffer car crashes on April 20 but on average, have the same demographic characteristics as the cases. Next, we need data on cannabis consumption, preferably on April 20. We want to show that the level of cannabis consumption is significantly higher for cases than for controls. (For further discussion of these analysis designs, see Chapter 2 of Numbers Rule Your World (link).) The actual study was neither experiment nor case-control. It was a piece of pure data analysis, based on "found data". I like to call this "adapted data," the "A" in my OCCAM framework for Big Data - data collected for other purposes that the researcher has adapted for his/her own objectives. In this study, the adapted data come from a database of fatal car crashes. So how was the adapted data analyzed? Harper/Palayew answer this question in their second description of the research: Over 25 years from 1992-2016, excess cannabis consumption after 4:20 pm on 4/20 increased fatal traffic crashes by 12% relative to fatal crashes that occurred one week before and one week after. The cases are the fatal car crashes that occurred after 4:20 pm on 420 Day. The comparison isn't to the drivers who did not suffer crashes on the same day. The reference group consisted of fatal car crashes that occurred after 4:20 pm on 4/13 and 4/27. The difference in the average number of crashes is taken to result from "excess cannabis consumption".  Notice that such a conclusion requires a strong assumption. We must believe that absent 420 Day, 4/13, 4/20 and 4/27 ought to have the same fatal crash frequencies.   *** You hopefully recognize that the analysis design for adapted data is on much shakier ground than either an experiment or a case-control study.  Harper/Palayew's initial debunking focused on one issue: what's so special about April 20? To answer that, they repeated the same analysis on every day of the year. The following pretty chart summarizes their finding: The red line is the line of no difference (between the analyzed day and the two reference days from the week before/after). Each vertical line is the range of estimate of the difference for a specific day of the year. The range for 4/20 is highlighted, and several other days with elevated fatal crash counts are labeled. The chart was originally published here, with the following commentary: "There is quite a lot of noise in these daily crash rate ratios, and few that appear reliably above or below the rates +/- one week." Andrew adds: "Nothing so exciting is happening on 20 Apr, which makes sense given that total accident rates are affected by so many things, with cannabis consumption being a very small part." While the chart looks cool, and sophisticated, the following histogram of the same data helps the reader digest the information.  I took the daily estimates of the fatal crash ratios from Harper/Palayew's published data. Each ratio presents the crashes on the analysis day relative to the crashes on the two reference days. The histogram shows the day-to-day variability of the crash ratios, which is what we need to answer the question: how special is 4/20? The histogram is roughly centered at 1.0 meaning no observed difference. The black vertical line shows the ratio for 4/20. It is leaning right - in fact, it is at the 94th-percentile. In classical terms, this is a p-value of 0.06, barely significant.  The following 21 days have more extreme ratios than 4/20:  Jul 4 Dec 23 Dec 21 Nov 21 Sep 1 Dec 20 Sep 2 Jul 3 Dec 31 Oct 31 Nov 23 Dec 18 Dec 6 Jul 14 Sep 4 Dec 22 Mar 17 May 25 Apr 1 Mar 7 Dec 19 Will JAMA editors accept one research paper for each of these days? The work is already done - the rest is story time.    P.S. [4/27/2019] Replaced the first chart with a newer version from Harper's site. This version contains the point estimates that the other version did not. Those point estimates are used to generate the histogram.

from Big Data, Plainly Spoken (aka Numbers Rule Your World) http://bit.ly/2J31CYb
via IFTTT

Comments

Popular posts from this blog

Controlling legend appearance in ggplot2 with override.aes

[This article was first published on Very statisticious on Very statisticious , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. In ggplot2 , aesthetics and their scale_*() functions change both the plot appearance and the plot legend appearance simultaneously. The override.aes argument in guide_legend() allows the user to change only the legend appearance without affecting the rest of the plot. This is useful for making the legend more readable or for creating certain types of combined legends. In this post I’ll first introduce override.aes with a basic example and then go through three additional plotting scenarios to how other instances where override.aes comes in handy. Table of Contents R packages Introducing override.aes Adding a guides() layer Using the guide argument in scale_*() Changing multiple aesthetic par...

Using RStudio and LaTeX

(This article was first published on r – Experimental Behaviour , and kindly contributed to R-bloggers) This post will explain how to integrate RStudio and LaTeX, especially the inclusion of well-formatted tables and nice-looking graphs and figures produced in RStudio and imported to LaTeX. To follow along you will need RStudio, MS Excel and LaTeX. Using tikzdevice to insert R Graphs into LaTeX I am a very visual thinker. If I want to understand a concept I usually and subconsciously try to visualise it. Therefore, more my PhD I tried to transport a lot of empirical insights by means of  visualization . These range from histograms, or violin plots to show distributions, over bargraphs including error bars to compare means, to interaction- or conditional effects of regression models. For quite a while it was very tedious to include such graphs in LaTeX documents. I tried several ways, like saving them as pdf and then including them in LaTeX as pdf, or any other file ...