DataScience4you2me

Posts

Showing posts with the label Big Data

How to act like a data scientist 6: exit polls prove Sanders turnout success

It's seriously dangerous to send a data scientist the data. (Andrew Gelman has wondered out loud often about why scientific researchers are so reluctant to release their datasets. Most recently, see this incident.) Now that I got a hold of the Super Tuesday data, there is no end in sight of testing all the talking points that the mainstream media has been pushing all day, all night. We were also supposed to believe that the Democrats have achieved a huge surge in turnout compared to 2016, and in particular, such turnout is attributed to Biden, and specifically denied to Sanders's supporters. This conclusion doesn't pass the smell test, because the same pundits told us that Biden achieved his victories on Super Tuesday without campaigning in many of those states (!), plus there is plenty of video evidence of huge crowds at rallies for Sanders. So, I pulled out the data. *** tldr; Exit polls prove Sanders was successfully at turning out first-time voters. The pundits just ign...

Using data to guess authorship of the Federalist Papers

I was recently reminded of how statistics settled a question of authorship of The Federalist Papers. These were 85 arguments published under a pseudonym in support of the U.S. Constitution. It is now known that they were written by Alexander Hamilton, James Madison and John Jay. For a long time, there was uncertainty about the authorship of twelve of these. Around 1960, two statisticians, Frederick Mosteller and David Wallace, published a celebrated paper that solved the riddle. The key components of the solution include: Noticing that people’s writing styles differ in terms of word preference. Certain writers habitually use certain words more often than others. “Common words” like prepositions are better differentiators than less common words. For one thing, common words are common, and therefore we have more data to establish the base rate of authors. For example, Madison almost never wrote the word “upon” while Hamilton used the word quite often; thus, a document that contains...

False-positive science

The Atlantic reports on the dynamics of yet another group of scientists coming to grips with having wasted time and resources chasing down a dead end. (link) It's a good read but long. Here is the gist of it: Almost 20 years ago, some researchers made a huge splash by claiming to have discovered the "depression gene". The one gene eventually engendered 450 publications, and when counting related genes, over 1,000 publications. A recent large-scale "validation" study is likely to bring down the entire cottage industry - the depression gene is found to have little explanatory power for depression after all. Gene data is an example of a type of Big Data. Big Data can be big in terms of the number of individuals in the dataset, or the number of measurements per individual. Two decades ago, the scale was attained by virtue of more measurements, not more individuals. The original study looked at about 300 or so individuals but each person's genome is vast. The bas...

Losing control of the data

Another day, another story about Facebook data. Ars Technica reports that Facebook is suing a South Korean app developer Rankwave, claiming the company misused data it received from Facebook. Rankwave creates mobile apps through which it obtained Facebook user data for 10 years. All we know about this situation came from the Facebook press release, and it's not clear what the offense is. The article cited the violation as using the Facebook data to "create and sell advertising and marketing analytics and models". That's how Facebook uses the user data, same as why Facebook partners want access to user data. One part of the press release rings very true: Facebook admits that it does not control the data once shared with third parties. Facebook lawyers demanded Rankwave do the following: Provide a full accounting of Facebook user data in its possession; Identify all individuals, organizations, and governmental entities to which it had sold, or otherwise distributed, Fac...

Labeling data: a crucial part of machine learning

Several recent news stories cover the topic of “labeling” data. For example, this Bloomberg article says Amazon is sending voice recordings from its Echo speakers to be heard and transcribed by human listeners. This Reuters article discusses Facebook’s contractors in India, Romania, Philippines, etc. hired to “label” status updates, shared links, event posts, Stories feature uploads, videos and photos. The authors of these articles express genuine shock and awe. They apparently believed that “machine learning” means no humans involved. The tech industry allows this misconception to fester by being opaque about how machine learning works. (The reporters are also dismayed by the privacy invasion. The Echo speakers are constantly recording in users’ households. Facebook did not have explicit permission from users to it send their data out for labeling.) *** Humans have always been a part of the machine learning workflow, and will continue to be. Let’s use one of the examples in the Facebo...

Some practice case interview questions for data science

In Part 1 of my KDnuggets article, I explained what hiring managers mean when they look for critical thinking in the arena of data science and analytics. These requirements relate to the nature of data problems found in industry and business settings. The datasets are generally observational, self-selected, non-random, with hidden biases, and increasingly OCCAM (link); the business leaders have high-level objectives ("we want to increase customer loyalty"). The data scientist/analyst is the person in the "middle," trying to figure out how to make the problem precise, and solvable by a systematic analysis of available data. In Part 2, I offer some practice case interview questions, based on three recent news events the college admissions scandal IPOs of ride-sharing companies like Lyft and Uber the Blue Apron post-IPO doldrums. Long a staple of the management consulting hiring process, the case interview is a free-flowing dialogue between the interviewer and the inte...

The limit of statistical understanding from adapted data

In the previous post, I described how some researchers found insights from a database of fatal car crashes. This dataset has all the markings of OCCAM data, which I use to summarize the characteristics of today's data. Observational the data come from reports of crash fatalities, rather than experiments, surveys, or other data collection methods No Controls the database only contains the cases, i.e. fatalities but not controls, which in this case should be drivers who did not suffer fatalities. The study design creates a type of control but as discussed in the previous post, the "controls" are still fatalities, just that they happened during different weeks. Such a study design requires the untested assumption that under normal circumstances, the frequency of fatalities to be constant within the three-week window of the study. Seemingly Complete it is assumed that all crashes involving fatalities are reported accurately in the database. This assumption is frequently disco...

What is the hardest part of the data science job search?

When I ask job-seekers what their biggest obstacle is to finding a job in data science and analytics, one of the most frequent answers is performing during the interview. Some of them are stumped by technical interviews (coding) while even more are worried about the case interviews. The purpose of the case interview is to test critical thinking. It is as challenging for the job candidate as for the hiring manager! Technical questions have pretty standard answers, and it's easy to score the answers. Case interviews are like essays - the hiring manager has to make judgment calls. My piece on critical thinking is featured at the KDNuggets blog, which I've followed since I was an analyst. In this first part, I explain the two aspects of critical thinking that the case interviewer is typically looking for. There will be a part 2 in which I provide some practice examples. P.S. [5/1/2019] This piece from TED is relevant. from Big Data, Plainly Spoken (aka Numbers Rule Your Worl...

DUIW 420: offering up 20 paper ideas pre-approved for prestigious journals

First, you have to read till the end for the 20 paper ideas. And if you're wondering about the acronym, it's Driving Under the Influence of Weed on 420 Day, which I learned from Andrew Gelman's blog is a day of celebration of cannabis. Andrew's blog post is about the exemplary work done by Sam Harper and Adam Palayew, debunking a highly-publicized JAMA study that claimed that 420 Day is responsible for a 12 percent increase in fatal car crashes. The discussion provides great fodder for examining how to investigate observational data, which is what most of Big Data is about. It is a cautionary tale for what not to do. *** The blog begins with Harper/Palayew channeling Staples/Redelmeier, the authors of the study: "fatal motor vehicle crashes increase by 12% after 4:20 pm on April 20th (an annual cannabis celebration)." This short sentence captures the gist of the original study but it omits an important detail: to what is the increase relative? If we ran an ex...

It's not us; it's the weather

If you are a frequent flier, you already know the gist of this nice article by the BBC: that airlines are allowed to sandbag the flight durations. A flight that takes 60 minutes will be portrayed to fliers as taking twice as long, if not longer. The airlines are even allowed to lie about this practice. When your flight is delayed taking off, the captain claims that s/he will “make up for the delay,” as if the plane could be driven faster on command. (Were they deliberately going slower before?) The truth is that the schedule is padded, so that it can absorb a limited amount of delay. This quote sums the situation up: “By padding, airlines are gaming the system to fool you.” At the very bottom of the article, you’d find the potential motivation – to avoid compensating travelers for long delays, as required by law in some countries. *** The situation here is similar to the road congestion problem discussed in Chapter 1 of Numbers Rule Your World (link). Managing perceived time is as impo...

How to use Kahneman-Tversky research from 1970s in the big data era

Just finished reading The Undoing Project by Michael Lewis, his bio of the Kahneman and Tversky duo who made many of the seminal discoveries in behavioral economics. In Chapter 7, Lewis recounts one of their most celebrated experiments which demonstrated the “base rate fallacy.” Here is one version of the experiment. The test subjects are asked to make judgments based on a vignette. Psychologists have administered tests to 100 people, 70 of whom are lawyers and 30 are engineers. (A) If one person is selected at random from this group, what is the chance that the selected person is a lawyer? (B) Dick is selected at random from this group. Here is a description of him: “Dick is a 30 year old man. He is married with no children. A man of high ability and high motivation, he promises to be quite successful in his field. He is well liked by his colleagues.” What is the chance that Dick is a lawyer? Those subjects who answered (A) made the right judgment, in accordance with the base ra...

Do wearable healthcare devices work?

A report came out from Stanford School of Medicine about a study of Apple Watch's health monitoring features. Some headline writers are proclaiming that "finally, there is proof that these watches benefit our health!" For example, Apple Watch Stanford Study Shows How It Can Save Lives (link). When you read the official story, you will learn the following facts about the study: The research is funded by Apple It was a purely observational study in which they follow (400,000) people who wear Apple Watches Participants must own both an Apple Watch and an iPhone to be eligible (plus meeting other criteria) There was no "control" group - they did not follow anyone who did not use Apple Watch or use any other health monitoring wearables Every participant is self-selected The device issued warnings to only 0.5 percent of the participants (~ 2,160) Those who received a warning were directed to a video consultation; and the doctor decided whether or not to send the parti...

Too much information as bad as no information

I haven't read Kartik's book but it looks like something I'd enjoy. In this interview with Verge, he cited the following experiment: One problem with grading in college courses is that different TAs are more or less lenient. So they used an algorithm to normalize or modify the grade so that the level of leniency was consistent. Then, one group received minimal information about how the algorithm worked, a second group got some high-level data, and the third got all of the information on how the algorithm worked and the raw data and all the changes made. The result was that the level of trust in the third group was back down to the same level as the group that didn’t receive any information. So it goes to show that if you reveal that much information, it’s as if you reveal nothing. This discussion is about transparency when it comes to algorithms. His recommendation is to provide "just enough" information but not "too much". The test group that received d...

Excel error, but could happen in any tool

The most famous Excel spreadsheet error in recent memory is the one that asserts that countries with high debt-to-GDP ratios experience slow growth. The most recent case of Excel error is the National Highway Traffic Safety Administration's (NHTSA) analysis using data supplied by Tesla to conclude that the "autosteer" feature reduces crash rates by 40%. NHTSA no longer stands behind that analysis, after it was debunked by a consulting firm called Quality Control Systems which spent two years to force the data to be released. The error can happen not just in Excel but any analytical tool because it relates to how missing data are treated. First, the analyst has to notice the existence of missings. Then, the analyst has to recognize - through some further analysis - that the data with missing values are not like the data without missing values. Finally, the analyst decides to treat the missing values in the appropriate manner - sometimes, they can be dropped; other times, i...

Is A/B testing that scary?

Reader AR pointed me to this Fast Company article that examines the ethics of A/B testing. The only way to comprehend this point of view is to think of A/B testing not as a scientific experiment but as a decision-making process that involves running an experiment. The researchers are unhappy that A/B tests could lend support to decisions that have undesirable impact on society. Two such examples are described: Two images are tested for a job ad. During the test, site visitors were shown one of the two images, selected at random. The winner of the test is an image that disproportionately drives male applicants. Separate pricing tests are run in different zip codes. The "winning" prices at the conclusion of these tests are different for different zip codes. Because racial profiles differ by zip code, prices are in effect different for different races. Therefore, the test result leads to race-based discrimination. There are two important questions to discuss here. First, what is...

Regulating data sharing is heating up

With the U.K. report on Facebook, and the stern language within it, the train on regulating data sharing may finally reach the station this year. The FTC is also likely to impose a stiff fine on Facebook for violating a consent decree. So let's learn more about this data sharing business. If you prefer a video, the gist of this post can be heard here. *** First, let's talk about data flows and the "cloud". Data are stored in computers that are called servers. In the cloud computing model, these servers are owned - not by the companies that collect the data - but by large tech companies like Amazon, Google, Microsoft, etc. who are responsible for managing the servers. These servers are geographically dispersed and so when data enter the cloud, they get replicated and spread to many servers. The technical benefit of such replication is recoverability of the data (allowing the use of cheaper, less reliable computers) but now, the data become much harder to delete. Data b...

How scary are the anti-vaxxers?

I don’t agree with Daniel’s conclusions in his article in Slate about the measles “crisis” but he did his research and there are lots to chew on. You don’t have to agree with him to find this article thought-provoking. There is one paragraph which everyone should read. It’s a celebration of science, and how it saved lives. (Daniel used this story for a different purpose: he argued that we never “eradicated” measles, and therefore, the anti-vaxxers could never have reversed some mythical victory.) During the most recent, major wave of measles infection in the U.S., between 1989 and 1991, close to 56,000 people fell ill and more than 100 people died...The 1989–91 epidemic was large enough and deadly enough to cast light on two pressing problems: First, that a single vaccine dose was not sufficient to protect children, and second, that black and Latino children, especially those living in urban areas, were less likely to be vaccinated, and thus more vulnerable to the disease. Efforts were...

Roundup of AI startups

Business Insider profiles 12 AI startups that their panel of venture capitalists considers likely to succeed in 2019. These startups fall into three categories: A) Enterprise apps: Appzen (auditing), Atrium (legal), FortressIQ (processes), Guru (knowledge management), People.ai (sales) B) Robots: Farmwise (tractor), 6 River Systems (warehouse), Shield.ai (drones in risky places) C) Others: Superhuman (email app), SambaNova (chips), Transfix (marketplace for trucking) *** Of these, I think group B (robots) is the most promising. The self-driving technology is particularly well suited in these settings (farms, warehouses) in which traffic control can be fully centralized. In some cases, the accuracy required is not too high, e.g. the tractor that differentiates weed from not weed needs only be moderately accurate. The drone company - it's not clear where the AI is. In Group A, I like Appzen, which uses AI to detect fraud in expense reports. It's clearly possible, and has a busine...

When the ATM fails to read my checks

Happy Lunar New Year! And greetings to Orlando people who are coming to my dataviz seminar this morning. *** What’s going on with digit recognition, one of the signature applications of machine learning? Before self-driving cars, before image recognition, before machine translation, there was digit recognition: computers are trained to read and recognize hand-written numbers. This problem shares several of the key components of problems tailor-made for machine learning methods: The correct answer is unambiguous for each item (i.e. image of a digit). The author of the digit has a particular number in mind. The range of possible answers across all items is finite. In a decimal system, each image can only be one of 0, 1, 2, ... , 9. The end-user only cares about how accurately the digit can be predicted. Causality is not of interest here. A massive dataset of labeled images, i.e. images that have been correctly recognized, used to train computers is easily obtained. Live application gener...

Wind chill, and its pointlessness

Slate has this very interesting little essay about the "wind chill factor." For those not in the U.S. (or not living in the cold parts of the U.S.), you may not know about our obsession with this number. Typically, the weather report says the temperature is 25F but it "feels like" 10F (32 Fahrenheit is 0 Celsius). The "feels like" is temperature adjusted by the so-called "wind chill factor." It conveys the idea that keeping temperature constant, it feels colder when there is wind. The Slate article covers a bunch of general issues related with inventing metrics: People love large numbers, in this case, because we are measuring cold temperatures, they like really small numbers The name of the metric may have little or nothing to do with what is being measured In seeking to make numbers more palatable to the public, people may choose less precise language that sometimes completely loses the original meaning. For example, "feels like" does...