Using Mister P to get population estimates from respondent driven sampling

From one of our exams:

A researcher at Columbia University’s School of Social Work wanted to estimate the prevalence of drug abuse problems among American Indians (Native Americans) living in New York City. From the Census, it was estimated that about 30,000 Indians live in the city, and the researcher had a budget to interview 400. She did not have a list of Indians in the city, and she obtained her sample as follows.

She started with a list of 300 members of a local American Indian community organization, and took a random sample of 100 from this list. She interviewed these 100 persons and asked each of these to give her the names of other Indians in the city whom they knew. She asked each respondent to characterize him/herself and also the people on the list on a 1-10 scale, where 10 is “strongly Indian-identified,” 5 is “moderately Indian-identified,” and 0 is “not at all Indian identified.” Most of the original 100 people sampled characterized themselves near 10 on the scale, which makes sense because they all belong to an Indian community organization. The researcher then took a random sample of 100 people from the combined lists of all the people referred to by the first group, and repeated this process. She repeated the process twice more to obtain 400 people in her sample.

Describe how you would use the data from these 400 people to estimate (and get a standard error for your estimate of) the prevalence of drug abuse problems among American Indians living in New York City. You must account for the bias and dependence of the nonrandom sampling method.

There are different ways to attack this problem but my preferred solution is to use Mister P:

1. Fit a regression model to estimate p(y|X)—in this case, y represents some measure of drug abuse problem at the individual level, and X includes demographic predictors and also a measure of Indian identification (necessary because the survey design oversamples of people who are strongly Indian identified) and a measure of gregariousness (necessary because the referral design oversamples people with more friends and acquaintances);

2. Estimate the distribution of X in the population (in this case, all American Indian adults living in New York City); and

3. Take the estimates from step 1, and average these over the distribution in step 2, to estimate the distribution of y over the entire population or any subpopulations of interest.

The hard part here is step 2, as I’m not aware of many published examples of such things. You have to build a model, and in that model you must account for the sampling bias. It can be done, though; indeed I’d like to do some examples of this to make these ideas more accessible to survey practitioners.

There’s some literature on this survey design—it’s called “respondent driven sampling”—but I don’t think the recommended analysis strategies are very good. MRP should be better, but, again, I should be able say this with more confidence and authority once I’ve actually done such an analysis for this sort of survey. Right now, I’m just a big talker.

The post Using Mister P to get population estimates from respondent driven sampling appeared first on Statistical Modeling, Causal Inference, and Social Science.

from Statistical Modeling, Causal Inference, and Social Science http://ift.tt/2ihKYoA
via IFTTT

DataScience4you2me

Search This Blog

Using Mister P to get population estimates from respondent driven sampling

Labels

Comments

Post a Comment

Popular posts from this blog

Former San Diego mayor joins race for California governor

Using RStudio and LaTeX

Explaining models with Triplot, part 1