How to post an R code question

Please read this before you post

Questions should address how to apply R to solve analytical challenges in applied epidemiology and public health

Overview of steps

  1. Do your research first
    Search our free Epidemiologist R Handbook, the R for Data Science book, or on Stack Overflow.

  2. Summarize your problem so that readers can re-create it on their own computers
    See this video on how to convert your R code to a “minimal, reproducible example”

    • Do not include sensitive or personal data in your post

    • To just include a small portion of code, write it within backticks (a backtick ` is not a quote mark ’ )

      ```
      linelist <- linelist_raw %>%
         filter(gender == "Male") %>%
         select(case_id, age, gender, outcome) 
      ```
      
  3. Add “tags” to your post so others can benefit (e.g. data cleaning, R markdown, shiny, etc.)

  4. Thank those who volunteered their time to provide an answer.

How to create a “reproducible example”:

Make it easy for people to help you. Give readers a way to re-create your problem on their own computer with a “minimal, reproducible example” of your problem:

Watch this instructional video.

  • Be minimal - include only the data and code required to reproduce your problem
  • Be reproducible - include all data and package commands, e.g. library() or p_load()

Below is an example:

Decide which data to use

Option 1. Use part of your dataset that is in R

The {datapasta} package converts a small portion of your dataset so that you can share it without the original dataset. Now readers can generate this small portion of your dataset on their own computer. See our instructional video.

:bangbang:Think seriously about whether you are allowed to share it. Ensure there is no patient, identifiable, or otherwise sensitive information.

Save a small part of your data as an object in your Environment, with a name (e.g. “demo_data”). Choose only enough data to demonstrate your problem.

demo_data <- my_linelist %>%
   head(5) %>%                          # keep only first 5 rows
   select(case_id, gender, onset_date)  # keep only certain columns

Run the function dpasta() on your object, like this:

dpasta(demo_data)

Now, in your R script there will appear a command to re-generate demo_data. You can paste this code into your “reproducible example” (see below) so that others can re-create and solve your problem.

data.frame(
  stringsAsFactors = FALSE,
           case_id = c("694928","86340d","92d002", "544bd1","6056ba"),
            gender = c("m", "f", "f", "f", "f"),
        onset_date = c("11/9/2014","10/30/2014", "8/16/2014","8/29/2014","10/20/2014")
)
Or, click here to use data in Excel

Alternatively, follow the instructions above to install and load {datapasta}, then go to your raw data (e.g. in an Excel file), copy only the data you want to include, and run the tribble_paste() function in your R script (with empty parentheses). This command will produce a command in your script that produces the data that was on your clipboard, in R.

tribble_paste()

For example, if this Excel data selection copied to your clipboard:

This code will be produced when the command tribble_paste() is run in your R script:

tibble::tribble(
  ~date_hospitalisation, ~date_outcome,       ~hospital,  ~outcome, ~gender, ~age, ~age_unit,
            "11/9/2014",  "11/21/2014",         "Other",        NA,     "m",  23L,   "years",
           "10/31/2014",  "11/15/2014", "Port Hospital", "Recover",     "f",   1L,   "years",
            "8/20/2014",            NA,       "Missing", "Recover",     "f",  16L,   "years",
            "8/30/2014",    "9/2/2014",       "Missing",   "Death",     "f",  10L,   "years",
           "10/21/2014",   "11/5/2014",       "Missing",   "Death",     "f",   0L,   "years",
            "11/1/2014",            NA, "Port Hospital", "Recover",     "f",   8L,   "years",
           "10/10/2014",  "10/12/2014",       "Missing",   "Death",     "f",   7L,   "years",
            "9/22/2014",            NA, "Port Hospital", "Recover",     "m",   4L,   "years",
            "5/11/2014",   "4/30/2014",         "Other",        NA,     "m",  37L,   "years",
            "9/30/2014",   "10/8/2014", "Port Hospital",        NA,     "m",  11L,   "years",
           "11/28/2014",   "12/5/2014", "Port Hospital",   "Death",     "m",  27L,   "years",
           "11/10/2014",  "11/14/2014", "Port Hospital",   "Death",     "f",   6L,   "years"
  )

You can add an assignment operator at the top of the code, e.g. mydata <- tibble::tribble( to name your dataset and reference it in later commands.

Or, click here to use a public dataset

Another simple and safe option is to use one of the publicly-available datasets to pose your question:

  1. Run data() to see R’s built-in datasets, OR
  2. Install the {outbreaks} R package and use one of its many datasets - for example by running install.packages("outbreaks") and then outbreaks::fluH7N9_china_2013
  3. Use one of Applied Epi’s public health datasets.

2. Now, make an example with the {reprex} package

The {reprex} package can assist you with making a reproducible example:

Write a simple, minimal R script that recreates your problem. Start by loading packages, and create a small demonstration dataset (see above technique using the {datapasta} package).

In the example below, we load packages with pacman::p_load(), use code generated by dpasta() to re-create a small part of our case linelist, then use mutate() and ymd() to try to convert the column onset_date from character to date class. However, we are confused why all the dates have been converted to NA!

# install and load packages
pacman::p_load(rio, lubridate, datapasta, reprex, tidyverse)

# generate demo dataset
demo_data <- data.frame(
  stringsAsFactors = FALSE,
  case_id = c("694928","86340d","92d002","544bd1","6056ba"),
  gender = c("m", "f", "f", "f", "f"),
  onset_date = c("11/9/2014","10/30/2014","8/16/2014","8/29/2014", "10/20/2014")
)

# check class of date column
class(demo_data$onset_date)

# try to convert column to class "Date"
demo_clean <- demo_data %>% 
  mutate(onset_date = ymd(onset_date))

# check the CLEANED date column class and range
class(demo_clean$onset_date)
range(demo_clean$onset_date)

Now, copy all the relevant code from your script to your clipboard, and run the following command:

reprex(session_info = TRUE)

You will see an HTML output appear in the RStudio Viewer pane. It will contain all your code and any warnings, errors, or plot outputs. This output is also copied to your clipboard, so you can paste it directly into an Applied Epi Community post or Github post.

Someone can copy your example and run it in their own computer. They can explain that we needed to use the {lubridate} function mdy() to convert onset_date column to date class, not ymd(), because the raw dates are written in the format month-day-year.

You can paste the “reprex” from your clipboard into an Applied Epi Community post, and other people can now re-create your problem, and tell you how to fix it!

3 Likes