Skip to main content

Using Mister P to get population estimates from respondent driven sampling

From one of our exams:

A researcher at Columbia University’s School of Social Work wanted to estimate the prevalence of drug abuse problems among American Indians (Native Americans) living in New York City. From the Census, it was estimated that about 30,000 Indians live in the city, and the researcher had a budget to interview 400. She did not have a list of Indians in the city, and she obtained her sample as follows.

She started with a list of 300 members of a local American Indian community organization, and took a random sample of 100 from this list. She interviewed these 100 persons and asked each of these to give her the names of other Indians in the city whom they knew. She asked each respondent to characterize him/herself and also the people on the list on a 1-10 scale, where 10 is “strongly Indian-identified,” 5 is “moderately Indian-identified,” and 0 is “not at all Indian identified.” Most of the original 100 people sampled characterized themselves near 10 on the scale, which makes sense because they all belong to an Indian community organization. The researcher then took a random sample of 100 people from the combined lists of all the people referred to by the first group, and repeated this process. She repeated the process twice more to obtain 400 people in her sample.

Describe how you would use the data from these 400 people to estimate (and get a standard error for your estimate of) the prevalence of drug abuse problems among American Indians living in New York City. You must account for the bias and dependence of the nonrandom sampling method.

There are different ways to attack this problem but my preferred solution is to use Mister P:

1. Fit a regression model to estimate p(y|X)—in this case, y represents some measure of drug abuse problem at the individual level, and X includes demographic predictors and also a measure of Indian identification (necessary because the survey design oversamples of people who are strongly Indian identified) and a measure of gregariousness (necessary because the referral design oversamples people with more friends and acquaintances);

2. Estimate the distribution of X in the population (in this case, all American Indian adults living in New York City); and

3. Take the estimates from step 1, and average these over the distribution in step 2, to estimate the distribution of y over the entire population or any subpopulations of interest.

The hard part here is step 2, as I’m not aware of many published examples of such things. You have to build a model, and in that model you must account for the sampling bias. It can be done, though; indeed I’d like to do some examples of this to make these ideas more accessible to survey practitioners.

There’s some literature on this survey design—it’s called “respondent driven sampling”—but I don’t think the recommended analysis strategies are very good. MRP should be better, but, again, I should be able say this with more confidence and authority once I’ve actually done such an analysis for this sort of survey. Right now, I’m just a big talker.

The post Using Mister P to get population estimates from respondent driven sampling appeared first on Statistical Modeling, Causal Inference, and Social Science.



from Statistical Modeling, Causal Inference, and Social Science http://ift.tt/2ihKYoA
via IFTTT

Comments

Popular posts from this blog

Explaining models with Triplot, part 1

[This article was first published on R in ResponsibleML on Medium , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't. Explaining models with triplot, part 1 tl;dr Explaining black box models built on correlated features may prove difficult and provide misleading results. R package triplot , part of the DrWhy.AI project, is aiming at facilitating the process of explaining the importance of the whole group of variables, thus solving the problem of correlated features. Calculating the importance of explanatory variables is one of the main tasks of explainable artificial intelligence (XAI). There are a lot of tools at our disposal that helps us with that, like Feature Importance or Shapley values, to name a few. All these methods calculate individual feature importance for each variable separately. The problem arises when features used ...

The con behind every wedding

With her marriage on the rocks, one writer struggles to reconcile her cynicism about happily-ever-after as her own children rush to tie the knot A lavish wedding, a couple in love; romance was in the air, as it should be when two people are getting married. But on the top table, the mothers of the happy pair were bonding over their imminent plans for … divorce. That story was told to me by the mother of the bride. The wedding in question was two summers ago: she is now divorced, and the bridegroom’s parents are separated. “We couldn’t but be aware of the crushing irony of the situation,” said my friend. “There we were, celebrating our children’s marriage, while plotting our own escapes from relationships that had long ago gone sour, and had probably been held together by our children. Now they were off to start their lives together, we could be off, too – on our own, or in search of new partners.” Continue reading... The Guardian http://ift.tt/2xZTguV October 07, 2017 at 09:00AM