Practical reference for survey sampling and analysis

iancgmd · September 9, 2022, 6:40am

Hello everyone! Can anyone recommend a reference for survey sampling, such as multi-stage cluster sampling? I understand the theories of the different sampling designs from general biostatistics textbooks. What I’m looking for is something more practical in applying the techniques such as using Excel or R in selecting clusters, PPS, applying sampling weights, and analysis after data collection. Thank you!

aspina · September 11, 2022, 6:05am

Hey Ian, great question … We are working on a new handbook chapter… But i’s currently a work in progress!

You can check the draft out on GitHub, if you click on the files changed tab and then scroll down to view diff on the rmd file (what shows up Green is the latest version):

github.com/appliedepi/epiRhandbook_eng

Sampling chapter

appliedepi:master ← appliedepi:sampling_chapter

opened 05:08AM - 16 Feb 22 UTC

AlexandreBlake

+1434 -80

@aspina7 I messed up a bit the creation of the branch. My bad, I do not use git …so much collaboratively. I will commit and push only to it starting now. There are few things I might need you feedback with: - I still have a lot to tweak here and there but and chunks to modify/add. But it should be enough to start getting a feedback so help yourself. - For now I generate data in my chunks to illustrate my points rather than load a pre-existing data set. I find it more convenient but it adds code that might not be the main interest of this chapter. I also noticed that in other chapters loading data seems to be the rule. No big deal? - I assumed that we were focusing on surveys and put the sample size calculation for analytical studies on the side. Did I assume right?

There’s also these conversations with example code

github.com/appliedepi/epiRhandbook_eng

Add to the survey section a component on selection of clusters with probability proportional to size

opened 08:02PM - 19 Jan 22 UTC

pbkeating

At MSF, we have an Excel tool that supports identification of clusters with prob…ability proportional to size, but this can also be done in R A first attempt at doing this with a sample dataset included for testing purposes I've validated this using 2 datasets - previously used one from MSF activities and from this WHO doc https://www.who.int/tb/advisory_bodies/impact_measurement_taskforce/meetings/prevalence_survey/psws_probability_prop_size_bierrenbach.pdf ``` gen_data -------------------------------------------------------------------------------- This section is for generating a fake dataset to test out the code ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - ```{r gen_data} ## set seed set.seed(50) ## Number of locations to select from n <- 20 ## Prefix prefix <- "location " ##Suffix suffix <- seq(1:n) ## Combine to create basic cluster selection dataset clusters <- data.frame(location_name = paste0(prefix, suffix), location_population = sample(1000:25000, n, replace = TRUE)) ``` ``` read_data -------------------------------------------------------------------------------- This section is for importing your actual location and population data ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --> ```{r read_data, warning = FALSE, message = FALSE} ### Read in location and population data --------------------------------------------------------------- ## Excel file ------------------------------------------------------------------ ## read in location data sheet # clusters <- rio::import(here::here("03 Sampling files", "cluster_data.xlsx"), # na = ".") ``` ``` identify_clusters -------------------------------------------------------------------------------- This section is to specify or calculate the following: - total population in the survey area - the number of clusters for the survey - the sampling interval, which is the total population divided by the number of clusters in the survey - the random starting point These figures will be combined together in a for loop to obtain a list of the clusters to be surveyed ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ --> ```{r identify_clusters} ## Set seed to ensure the random start remains the same each time set.seed(50) ## Calculate total population total_pop <- sum(clusters$location_population, na.rm = T) ## Calculate cumulative sum of the population clusters$cum_sum <- cumsum(clusters$location_population) ## Specify the number of clusters cluster_number <- 10 ## Calculate sampling interval and round it up sampling_interval <- round(total_pop/cluster_number, digits = 0) ## Select a random starting point between 1 and the sampling interval random_start <- sample(1:sampling_interval,1) ## This for loop will identify the locations to survey for (i in 1:length(clusters$cum_sum)) { if (i == 1) { clusters$number_clusters[i] = as.integer(((clusters$cum_sum[i] - random_start)/(sampling_interval) +1)) clusters$cum_clusters[i] = clusters$number_clusters } else { clusters$number_clusters[i] = as.integer((((clusters$cum_sum[i] - random_start)/(sampling_interval) +1) - clusters$cum_clusters[i-1]), digits = 0) clusters$cum_clusters[i] = clusters$number_clusters[i] + clusters$cum_clusters[i-1] } } ```

There’s also already a section in the handbook for doing survey analysis … Doesn’t include all designs yet but might be a good start?

Hope that helps!

iancgmd · September 14, 2022, 4:20am

Thank you for this, Neale! I will look into the Survey Analysis chapter.

iancgmd · September 14, 2022, 7:09am

Hi @neale ! I’m working my way through the survey analysis guide, and in the cleaning data section, the comment line in the last part of the code states “change to dates” but it looks like it converts the yes/no character variables to TRUE/FALSE logical variables instead?

change to dates

survey_data ← survey_data %>%
mutate(across(all_of(YNVARS),
str_detect,
pattern = “yes”))

iancgmd · September 15, 2022, 7:42am

Also, part 23.6 starts with joining a Kobo dataset at the household and individual level, but the datasets being referred in the example (survey_data_hh and survey_data_indiv) are not available for download in chapter 2, but rather the joins have been already done in the given survey_data file?

neale · September 15, 2022, 2:00pm

Hi Ian,
This chapter is still a work in progress and the datasets are not yet available from the Handbook. You are right that the mutate(across() command appears to change a vector of yes/no columns.
Hopefully we can address these concerns soon.
Neale

iancgmd · September 16, 2022, 7:59am

Thanks again, @neale !

amy.mikhail · November 1, 2022, 5:01pm

Hi @iancgmd,

You may also find the EPIET case study (based on a vaccination coverage survey) useful:

Amy

iancgmd · November 2, 2022, 1:04am

This is a great resource! Thank you, Amy!