Skip to main content

How to use Kahneman-Tversky research from 1970s in the big data era

Just finished reading The Undoing Project by Michael Lewis, his bio of the Kahneman and Tversky duo who made many of the seminal discoveries in behavioral economics. In Chapter 7, Lewis recounts one of their most celebrated experiments which demonstrated the “base rate fallacy.” Here is one version of the experiment. The test subjects are asked to make judgments based on a vignette. Psychologists have administered tests to 100 people, 70 of whom are lawyers and 30 are engineers. (A) If one person is selected at random from this group, what is the chance that the selected person is a lawyer? (B) Dick is selected at random from this group. Here is a description of him: “Dick is a 30 year old man. He is married with no children. A man of high ability and high motivation, he promises to be quite successful in his field. He is well liked by his colleagues.” What is the chance that Dick is a lawyer?   Those subjects who answered (A) made the right judgment, in accordance with the base rate of 70 percent. The answer to (B) should be the same, since it shouldn't matter whether the random person is named Dick or not, and the generic description provides no useful information to determine Dick’s occupation. However, those subjects who answered (B) edited the chance down to about 50-50. The experiment showed that access to Dick’s description led people astray – to ignore the base rate. Note that the base rate here is the prior probability. *** What are the practical applications of the KT experiment for business data analysts? tl;dr Before throwing the kitchen sink of variables (features) into your statistical (machine learning) models, review the literature on the base rate fallacy starting with Kahneman-Tversky experiments.   1. Adding more variables can make your predictions worse Let's start with what kind of additional information is provided by Dick’s description. The sample size has not changed – it’s still one. The data expanded only in the number of variables (or features). Specifically, these eight additional variables: X1 = age X2 = gender X3 = martial status X4 = number of children X5 = ability level X6 = motivation level X7 = expected level of success in field X8 = popularity among colleagues In today’s age of surveillance data, it is all too easy for any analyst to assemble more variables. The KT experiment shows that having more variables does not imply you have more useful information. Worse, those extra variables may distract you from the base rate, leading to worse predictions.   2. Machines are even more susceptible than humans If humans are prone to such mistakes, should we use machines instead? Sadly, machines will perform worse. Machines allow us to process even more variables at even greater efficiency. Instead of eight useless variables, you can now add 800 or even 8,000 useless variables about Dick. The machines will then inform you which subset of these variables “pop.” The more useless data you add in, the higher the chance you will encounter an accidental correlation.   Before throwing the kitchen sink of variables (features) into your statistical (machine learning) models, review the literature on the base rate fallacy starting with ground-breaking Kahneman-Tversky experiments.    

from Big Data, Plainly Spoken (aka Numbers Rule Your World) http://bit.ly/2U1w9XU
via IFTTT

Comments

Popular posts from this blog

Controlling legend appearance in ggplot2 with override.aes

[This article was first published on Very statisticious on Very statisticious , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. In ggplot2 , aesthetics and their scale_*() functions change both the plot appearance and the plot legend appearance simultaneously. The override.aes argument in guide_legend() allows the user to change only the legend appearance without affecting the rest of the plot. This is useful for making the legend more readable or for creating certain types of combined legends. In this post I’ll first introduce override.aes with a basic example and then go through three additional plotting scenarios to how other instances where override.aes comes in handy. Table of Contents R packages Introducing override.aes Adding a guides() layer Using the guide argument in scale_*() Changing multiple aesthetic par...

Using RStudio and LaTeX

(This article was first published on r – Experimental Behaviour , and kindly contributed to R-bloggers) This post will explain how to integrate RStudio and LaTeX, especially the inclusion of well-formatted tables and nice-looking graphs and figures produced in RStudio and imported to LaTeX. To follow along you will need RStudio, MS Excel and LaTeX. Using tikzdevice to insert R Graphs into LaTeX I am a very visual thinker. If I want to understand a concept I usually and subconsciously try to visualise it. Therefore, more my PhD I tried to transport a lot of empirical insights by means of  visualization . These range from histograms, or violin plots to show distributions, over bargraphs including error bars to compare means, to interaction- or conditional effects of regression models. For quite a while it was very tedious to include such graphs in LaTeX documents. I tried several ways, like saving them as pdf and then including them in LaTeX as pdf, or any other file ...