Skip to main content

Posts

Showing posts with the label Simply Statistics

Research quality data and research quality databases

When you are doing data science, you are doing research. You want to use data to answer a question, identify a new pattern, improve a current product, or come up with a new product. The common factor underlying each of these tasks is that you want to use the data to answer a question that you haven’t answered before. The most effective process we have come up for getting those answers is the scientific research process. That is why the key word in data science is not data, it is science. No matter where you are doing data science - in academia, in a non-profit, or in a company - you are doing research. The data is the substrate you use to get the answers you care about. The first step most people take when using data is to collect the data and store it. This is a data engineering problem and is a necessary first step before you can do data science. But the state and quality of the data you have can make a huge amount of difference in how fast and accurately you can get answers. If the...

I co-founded a company! Meet Problem Forward Data Science

I have some exciting news about something I’ve been working on for the last year or so. I started a company! It’s called Problem Forward data science. I’m pumped about this new startup for a lot of reasons. My co-founder is one of my families closest friends, Jamie McGovern, who has more than 2 decades of experience in the consulting world and who I’ve known for 15 years. We are creating a cool new model of “data scientist as a service” (more on that below) We have a problem forward, not solution backward approach to data science that grew out of the Hopkins philosophy of data science. We are headquartered in East Baltimore and are creating awesome new tech jobs in a place where they haven’t been historically. Problem Forward, Not Solution Backward We have always had a “problem forward, not solution backward” approach to statistics, machine learning and data here at Simply Stats. This has grown out of the Johns Hopkins Biostatistics philosophy of starting with the public heal...

Generative and Analytical Models for Data Analysis

Describing how a data analysis is created is a topic of keen interest to me and there are a few different ways to think about it. Two different ways of thinking about data analysis are what I call the “generative” approach and the “analytical” approach. Another, more informal, way that I like to think about these approaches is as the “biological” model and the “physician” model. Reading through the literature on the process of data analysis, I’ve noticed that many seem to focus on the former rather than the latter and I think that presents an opportunity for new and interesting work. Generative Model The generative approach to thinking about data analysis focuses on the process by which an analysis is created. Developing an understanding of the decisions that are made to move from step one to step two to step three, etc. can help us recreate or reconstruct a data analysis. While reconstruction may not exactly be the goal of studying data analysis in this manner, having a better under...

Interview with Abhi Datta

Editor’s note: This is the next in our series of interviews with early career statisticians and data scientists. Today we are talking to Abhi Datta about his work in large scale spatial analysis and his interest in soccer! Follow him on Twitter at @datta_science . If you have recommendations of an (early career) person in academics or industry you would like to see promoted, reach out to Jeff (@jtleek) on Twitter! SS: Do you consider yourself a statistician, biostatistician, data scientist, or something else? AD: That is a difficult question for me, as I enjoy working on theory, methods and data analysis and have co-authored diverse papers ranging from theoretical expositions to being primarily centered around a complex data analysis. My research interests also span a wide range of areas. A lot of my work on spatial statistics is driven by applications in environmental health and air pollution. Another significant area of my research is developing Bayesian models for epidemiological ...

10 things R can do that might surprise you

Over the last few weeks I’ve had a couple of interactions with folks from the computer science world who were pretty disparaging of the R programming language. A lot of the critism focused on perceived limitations of R to statistical analysis. It’s true, R does have a hugely comprehensive list of analysis packages on CRAN , Bioconductor , Neuroconductor , and ROpenSci as well as great package management. As I was having these conversations I realized that R has grown into a multi-purpose connective language for things beyond just data analysis. But that the functionality isn’t always as well known outside of the R community. So this post is about some of the ridiculously awesome features of R that may or may not be as widely known. Here are 10 things R can do you might not have known about, building on Kara’s great tweet thread about lighthearted things to do with R . 1. You can write reproducible Word or Powerpoint documents from R markdown The rmarkdown package lets you create r...

Open letter to journal editors: dynamite plots must die

Statisticians have been pointing out the problem with dynamite plots, also known as bar and line graphs, for years. Karl Broman lists them as one of the top ten worst graphs . The problem has even been documneted in the peer reviewed literature. For example, this British Journal of Pharmacology paper titled Show the data, don’t conceal them was published in 2011. However, despite all these efforts, dynamite plots continue to be ubiquitous in the scientific literature. Just open the latest issue of Nature, Science or Cell and you will likely see a few. In fact, in this PLOS Biology paper , Tracey Weissgerber and co-authors perform a systmetic review of “top physiology journals” and find that “85.6% of papers included at least one bar graph”. They go on to recommend “training investigators in data presentation, encouraging a more complete presentation of data, and changing journal editorial policies”. In my view, the training will be accelerated if editors implement a policy that requ...

Interview with Stephanie Hicks

Editor’s note: For a while we ran an interview series for statisticians and data scientists, but things have gotten a little hectic around here so we’ve dropped the ball! But we are re-introducing the series, starting with Stephanie Hicks. If you have recommendations of a (junior) person in academics or industry you would like to see promoted, reach out to Jeff (@jtleek) on Twitter! Stephanie Hicks received her PhD in statistics in 2013 at Rice University and has already made major contributions to the analysis of single cell sequencing data and the theory and practice of teaching data science. SS: Do you consider yourself a statistician, biostatistician, data scientist or something else? SH: Fantastic question! I’m a statistician by training, and I work in a department of biostatistics, so I would be remiss if I didn’t answer a statistician. However my interests are at the intersection of statistics, data science, genomics and education. Broadly, my research interests are to levera...

The Tentpoles of Data Science

What makes for a good data scientist? This is a question I asked a long time ago and am still trying to figure out the answer. Seven years ago, I wrote: I was thinking about the people who I think are really good at data analysis and it occurred to me that they were all people I knew. So I started thinking about people that I don’t know (and there are many) but are equally good at data analysis. This turned out to be much harder than I thought. And I’m sure it’s not because they don’t exist, it’s just because I think good data analysis chops are hard to evaluate from afar using the standard methods by which we evaluate people. Now that time has passed and I’ve had an opportunity to see what’s going on in the world of data science, what I think about good data scientists, and what seems to make for good data analysis, I have a few more ideas on what makes for a good data scientist. In particular, I think there are broadly five “tentpoles” for a good data scientist. Each tentpole re...

How Data Scientists Think - A Mini Case Study

In episode 71 of Not So Standard Deviations , Hilary Parker and I inaugurated our first “Data Science Design Challenge” segment where we discussed how we would solve a given problem using data science. The idea with calling it a “design challenge” was to contrast it with common “hackathon” type models where you are presented with an already-collected dataset and then challenged to find something interesting in the data. Here, we wanted to start with a problem and then talk about how data might be collected and analyzed to address the problem. While both approaches might result in the same end-product, they address the various problems you encounter in a data analysis in a different order. In this post, I want to break down our discussion of the challenge and highlight some of the issues that were discussed in framing the problem and in designing the data collection and analysis. I’ll end with some thoughts about generalizing this approach to other problems. You can download an MP3 of...