Skip to main content

The limit of statistical understanding from adapted data

In the previous post, I described how some researchers found insights from a database of fatal car crashes. This dataset has all the markings of OCCAM data, which I use to summarize the characteristics of today's data. Observational the data come from reports of crash fatalities, rather than experiments, surveys, or other data collection methods No Controls the database only contains the cases, i.e. fatalities but not controls, which in this case should be drivers who did not suffer fatalities. The study design creates a type of control but as discussed in the previous post, the "controls" are still fatalities, just that they happened during different weeks. Such a study design requires the untested assumption that under normal circumstances, the frequency of fatalities to be constant within the three-week window of the study. Seemingly Complete it is assumed that all crashes involving fatalities are reported accurately in the database. This assumption is frequently discovered to be wrong when the analyst digs into the data. A recent example is the Tesla auto-pilot analysis: even though in theory Tesla should have data on all its vehicles, the spreadsheet contains a large number of missing values. Adapted the fatality data are collected for a number of uses, none of which is to investigate the potential effect of 420 Cannabis Day. Adapted data is sometimes called found data or data exhaust Merged For this analysis, the researchers did not merge datasets. Most of the time, they do. For example, one of the commenters suggests looking at the effect of temperature. To do that requires merging local temperature data with the fatality data. Merging data creates all kinds of potential data quality issues. *** In this post, we shall forget about the conclusion of the previous post, that April 20 may not be extraordinary. We accept that April 20 is an unusual day. The first question to ask is: unusual in what way? Let's look at the histogram again: April 20 is unusual in having a higher number of fatal car crashes compared to the average of April 13 and 27. That is what we learned from the data. Our next question is: why is April 20 worse? According to the original study, the reason for the excess fatalities is excess cannabis consumption on April 20 because 420 is cannabis celebration day. But at this point, we only have story time. Story time is the spinning of grand stories based on tiny morsels of data. The moment hits you in the second half of a newspaper article or research report after the author presents the data analyses, when you realize that story-telling has begun, and the report strays far from the evidence. In this case, it's the link between excess fatalities and excess cannabis consumption that is tenuous. The problem goes back to OCCAM data, and lack of proper controls. If we could perform an experiment, the evidence would have been interpreted more directly. The database of fatalities does not contain data on cannabis consumption. The original study has some info on "Drug police report" with over 60 percent of the cases listed as "not tested or not reported". This information is not used to argue one way or another about cannabis consumption. The next step for this type of study is finding corroborating evidence to support the causal story. For example, are more of these accidents occurring around neighborhoods in which 420 Day is being celebrated? Can we find neighborhoods that only started celebrating 420 Day after a certain year and look at whether a jump in crash fatalities occurred after that year? Do people drive more or less frequently after they smoke weed? Are there proxies for cannabis consumption? (for example, maybe cannabis users are more likely to drive certain cars.) etc. Harper and Palayew looked into whether the crash ratio got worse over time because cannabis consumption may have increased over time. They failed to see this, which weakens the conclusion.  

from Big Data, Plainly Spoken (aka Numbers Rule Your World) http://bit.ly/2JaT3dw
via IFTTT

Comments

Popular posts from this blog

Controlling legend appearance in ggplot2 with override.aes

[This article was first published on Very statisticious on Very statisticious , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. In ggplot2 , aesthetics and their scale_*() functions change both the plot appearance and the plot legend appearance simultaneously. The override.aes argument in guide_legend() allows the user to change only the legend appearance without affecting the rest of the plot. This is useful for making the legend more readable or for creating certain types of combined legends. In this post I’ll first introduce override.aes with a basic example and then go through three additional plotting scenarios to how other instances where override.aes comes in handy. Table of Contents R packages Introducing override.aes Adding a guides() layer Using the guide argument in scale_*() Changing multiple aesthetic par...

Using RStudio and LaTeX

(This article was first published on r – Experimental Behaviour , and kindly contributed to R-bloggers) This post will explain how to integrate RStudio and LaTeX, especially the inclusion of well-formatted tables and nice-looking graphs and figures produced in RStudio and imported to LaTeX. To follow along you will need RStudio, MS Excel and LaTeX. Using tikzdevice to insert R Graphs into LaTeX I am a very visual thinker. If I want to understand a concept I usually and subconsciously try to visualise it. Therefore, more my PhD I tried to transport a lot of empirical insights by means of  visualization . These range from histograms, or violin plots to show distributions, over bargraphs including error bars to compare means, to interaction- or conditional effects of regression models. For quite a while it was very tedious to include such graphs in LaTeX documents. I tried several ways, like saving them as pdf and then including them in LaTeX as pdf, or any other file ...