The limit of statistical understanding from adapted data

In the previous post, I described how some researchers found insights from a database of fatal car crashes. This dataset has all the markings of OCCAM data, which I use to summarize the characteristics of today's data. Observational the data come from reports of crash fatalities, rather than experiments, surveys, or other data collection methods No Controls the database only contains the cases, i.e. fatalities but not controls, which in this case should be drivers who did not suffer fatalities. The study design creates a type of control but as discussed in the previous post, the "controls" are still fatalities, just that they happened during different weeks. Such a study design requires the untested assumption that under normal circumstances, the frequency of fatalities to be constant within the three-week window of the study. Seemingly Complete it is assumed that all crashes involving fatalities are reported accurately in the database. This assumption is frequently discovered to be wrong when the analyst digs into the data. A recent example is the Tesla auto-pilot analysis: even though in theory Tesla should have data on all its vehicles, the spreadsheet contains a large number of missing values. Adapted the fatality data are collected for a number of uses, none of which is to investigate the potential effect of 420 Cannabis Day. Adapted data is sometimes called found data or data exhaust Merged For this analysis, the researchers did not merge datasets. Most of the time, they do. For example, one of the commenters suggests looking at the effect of temperature. To do that requires merging local temperature data with the fatality data. Merging data creates all kinds of potential data quality issues. *** In this post, we shall forget about the conclusion of the previous post, that April 20 may not be extraordinary. We accept that April 20 is an unusual day. The first question to ask is: unusual in what way? Let's look at the histogram again: April 20 is unusual in having a higher number of fatal car crashes compared to the average of April 13 and 27. That is what we learned from the data. Our next question is: why is April 20 worse? According to the original study, the reason for the excess fatalities is excess cannabis consumption on April 20 because 420 is cannabis celebration day. But at this point, we only have story time. Story time is the spinning of grand stories based on tiny morsels of data. The moment hits you in the second half of a newspaper article or research report after the author presents the data analyses, when you realize that story-telling has begun, and the report strays far from the evidence. In this case, it's the link between excess fatalities and excess cannabis consumption that is tenuous. The problem goes back to OCCAM data, and lack of proper controls. If we could perform an experiment, the evidence would have been interpreted more directly. The database of fatalities does not contain data on cannabis consumption. The original study has some info on "Drug police report" with over 60 percent of the cases listed as "not tested or not reported". This information is not used to argue one way or another about cannabis consumption. The next step for this type of study is finding corroborating evidence to support the causal story. For example, are more of these accidents occurring around neighborhoods in which 420 Day is being celebrated? Can we find neighborhoods that only started celebrating 420 Day after a certain year and look at whether a jump in crash fatalities occurred after that year? Do people drive more or less frequently after they smoke weed? Are there proxies for cannabis consumption? (for example, maybe cannabis users are more likely to drive certain cars.) etc. Harper and Palayew looked into whether the crash ratio got worse over time because cannabis consumption may have increased over time. They failed to see this, which weakens the conclusion.

from Big Data, Plainly Spoken (aka Numbers Rule Your World) http://bit.ly/2JaT3dw
via IFTTT

DataScience4you2me

Search This Blog

The limit of statistical understanding from adapted data

Labels

Comments

Post a Comment

Popular posts from this blog

Former San Diego mayor joins race for California governor

Controlling legend appearance in ggplot2 with override.aes

Using RStudio and LaTeX