Counting concatenated risk factors in linelist data

sansskin · April 23, 2022, 1:31am

Hello,

I have been tasked with analyzing some linelist data where there is a column for risk factors such that all of the factors present for a given case are concatenated into a character separated by commas. I am interested in producing the frequencies that each risk factor appear in the data but can’t find any simple solution.

Here is an example of what the linelist data looks like:

library(tidyverse)

tibble::tribble(
    ~id, ~age, ~age_unit, ~risk_factors,
    1, 20, "years", "diabetes, hypertension",
    2, 26, "years", "immunocompromised",
    3, 24, "years", "lupus, hypertension, immunocompromised"
  )
#> # A tibble: 3 × 4
#>      id   age age_unit risk_factors                          
#>   <dbl> <dbl> <chr>    <chr>                                 
#> 1     1    20 years    diabetes, hypertension                
#> 2     2    26 years    immunocompromised                     
#> 3     3    24 years    lupus, hypertension, immunocompromised

^{Created on 2022-04-22 by the reprex package (v2.0.1)}

Any help would be appreciated!

Regards,

John

machupovirus · April 23, 2022, 1:50am

Hello John,

There is actually a very handy function in the tidyr package that will help. The function is named separate_rows and will essentially break the concatenated risk factors character up at a given delimiter and create a row for each with the values for all other variables held constant.

Here is a quick example where the linelist data is converted to long data and then the frequencies for each risk factor are produced:

#Loading tidyverse
library(tidyverse)

#Generating linelist data
linelist_data <-
    tibble(
        id = seq_len(length.out = 3),
        age = rpois(n = 3, lambda = 25),
        age_unit = rep(x = "years", 3),
        risk_factors = c(
            "diabetes, hypertension",
            "immunocompromised",
            "lupus, hypertension, immunocompromised"
        )
    )

#Creating long data
long_data <-
    linelist_data |>
    separate_rows(risk_factors, sep = ", ")

#Examining the long data
long_data |>
    slice_head(n = 5)
#> # A tibble: 5 × 4
#>      id   age age_unit risk_factors     
#>   <int> <int> <chr>    <chr>            
#> 1     1    18 years    diabetes         
#> 2     1    18 years    hypertension     
#> 3     2    31 years    immunocompromised
#> 4     3    25 years    lupus            
#> 5     3    25 years    hypertension

#Counting risk factors
long_data |>
    count(risk_factors)
#> # A tibble: 4 × 2
#>   risk_factors          n
#>   <chr>             <int>
#> 1 diabetes              1
#> 2 hypertension          2
#> 3 immunocompromised     2
#> 4 lupus                 1

^{Created on 2022-04-22 by the reprex package (v2.0.1)}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.1.3 (2022-03-10)
#>  os       macOS Big Sur/Monterey 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_CA.UTF-8
#>  ctype    en_CA.UTF-8
#>  tz       America/Toronto
#>  date     2022-04-22
#>  pandoc   2.17.1.1 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
#>  backports     1.4.1   2021-12-13 [1] CRAN (R 4.1.0)
#>  broom         0.8.0   2022-04-13 [1] CRAN (R 4.1.3)
#>  cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.1.0)
#>  cli           3.2.0   2022-02-14 [1] RSPM (R 4.1.2)
#>  colorspace    2.0-3   2022-02-21 [1] RSPM (R 4.1.2)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.1.3)
#>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.1.1)
#>  dbplyr        2.1.1   2021-04-06 [1] CRAN (R 4.1.0)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.1.1)
#>  dplyr       * 1.0.8   2022-02-08 [1] RSPM (R 4.1.2)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
#>  evaluate      0.15    2022-02-18 [1] RSPM (R 4.1.2)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.1.3)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
#>  forcats     * 0.5.1   2021-01-27 [1] CRAN (R 4.1.0)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.1)
#>  generics      0.1.2   2022-01-31 [1] RSPM (R 4.1.2)
#>  ggplot2     * 3.3.5   2021-06-25 [1] CRAN (R 4.1.0)
#>  glue          1.6.2   2022-02-24 [1] RSPM (R 4.1.2)
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.1.0)
#>  haven         2.5.0   2022-04-15 [1] CRAN (R 4.1.3)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.0)
#>  hms           1.1.1   2021-09-26 [1] CRAN (R 4.1.1)
#>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.0)
#>  httr          1.4.2   2020-07-20 [1] CRAN (R 4.1.0)
#>  jsonlite      1.8.0   2022-02-22 [1] RSPM (R 4.1.2)
#>  knitr         1.38    2022-03-25 [1] CRAN (R 4.1.3)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.1)
#>  lubridate     1.8.0   2021-10-07 [1] CRAN (R 4.1.1)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
#>  modelr        0.1.8   2020-05-19 [1] CRAN (R 4.1.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.1.0)
#>  pillar        1.7.0   2022-02-01 [1] RSPM (R 4.1.2)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
#>  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
#>  R.cache       0.15.0  2021-04-30 [1] CRAN (R 4.1.0)
#>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.1.0)
#>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.1.0)
#>  R.utils       2.11.0  2021-09-26 [1] CRAN (R 4.1.1)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.0)
#>  readr       * 2.1.2   2022-01-30 [1] RSPM (R 4.1.2)
#>  readxl        1.4.0   2022-03-28 [1] CRAN (R 4.1.3)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.0)
#>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.1.2)
#>  rmarkdown     2.13    2022-03-10 [1] CRAN (R 4.1.2)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.0)
#>  rvest         1.0.2   2021-10-16 [1] CRAN (R 4.1.1)
#>  scales        1.2.0   2022-04-13 [1] CRAN (R 4.1.3)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.1)
#>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.1.1)
#>  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.1.0)
#>  styler        1.7.0   2022-03-13 [1] CRAN (R 4.1.2)
#>  tibble      * 3.1.6   2021-11-07 [1] CRAN (R 4.1.1)
#>  tidyr       * 1.2.0   2022-02-01 [1] RSPM (R 4.1.2)
#>  tidyselect    1.1.2   2022-02-21 [1] RSPM (R 4.1.2)
#>  tidyverse   * 1.3.1   2021-04-15 [1] CRAN (R 4.1.0)
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.1.3)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
#>  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.1.3)
#>  withr         2.5.0   2022-03-03 [1] RSPM (R 4.1.2)
#>  xfun          0.30    2022-03-02 [1] RSPM (R 4.1.2)
#>  xml2          1.3.3   2021-11-30 [1] CRAN (R 4.1.1)
#>  yaml          2.3.5   2022-02-21 [1] RSPM (R 4.1.2)
#> 
#>  [1] /Users/timothychisamore/Library/R/x86_64/4.1/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.1/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

All the best,

Tim

neale · April 25, 2022, 2:35am

Hi John,

Thanks for engaging in this forum! You ask a question that is familiar to many visitors to this forum - thank for you clear example.

Please check out this chapter of the Epi R Handbook on handling “strings” (character values in R).

Tim’s answer is great, and below I offer some additional code using separate() in case you want to split the symptoms into their own columns (not as great for counting, but perhaps for other purposes).

# load packages
pacman::p_load(tidyverse)

# create example data
linelist_data <-
  tibble(
    id = seq_len(length.out = 3),
    age = rpois(n = 3, lambda = 25),
    age_unit = rep(x = "years", 3),
    risk_factors = c(
      "diabetes, hypertension",
      "immunocompromised",
      "lupus, hypertension, immunocompromised"
    )
  )

# separate risk factors into columns
split_wide <- linelist_data %>% 
  separate(risk_factors, into = c("sym_1", "sym_2", "sym_3"), sep=",")
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [1, 2].

# print wide data
split_wide
#> # A tibble: 3 x 6
#>      id   age age_unit sym_1             sym_2           sym_3               
#>   <int> <int> <chr>    <chr>             <chr>           <chr>               
#> 1     1    25 years    diabetes          " hypertension"  <NA>               
#> 2     2    25 years    immunocompromised  <NA>            <NA>               
#> 3     3    26 years    lupus             " hypertension" " immunocompromised"

^{Created on 2022-04-24 by the reprex package (v2.0.1)}

Session info

sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.1.3 (2022-03-10)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.1252
#>  ctype    English_United States.1252
#>  tz       America/New_York
#>  date     2022-04-24
#>  pandoc   2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.3)
#>  backports     1.4.1   2021-12-13 [1] CRAN (R 4.1.2)
#>  broom         0.7.12  2022-01-28 [1] CRAN (R 4.1.3)
#>  cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.1.3)
#>  cli           3.2.0   2022-02-14 [1] CRAN (R 4.1.3)
#>  colorspace    2.0-3   2022-02-21 [1] CRAN (R 4.1.3)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.1.3)
#>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.1.3)
#>  dbplyr        2.1.1   2021-04-06 [1] CRAN (R 4.1.3)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.1.3)
#>  dplyr       * 1.0.8   2022-02-08 [1] CRAN (R 4.1.3)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.3)
#>  evaluate      0.15    2022-02-18 [1] CRAN (R 4.1.3)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.1.3)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.3)
#>  forcats     * 0.5.1   2021-01-27 [1] CRAN (R 4.1.3)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.3)
#>  generics      0.1.2   2022-01-31 [1] CRAN (R 4.1.3)
#>  ggplot2     * 3.3.5   2021-06-25 [1] CRAN (R 4.1.3)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.1.3)
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.1.3)
#>  haven         2.4.3   2021-08-04 [1] CRAN (R 4.1.3)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.3)
#>  hms           1.1.1   2021-09-26 [1] CRAN (R 4.1.3)
#>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.3)
#>  httr          1.4.2   2020-07-20 [1] CRAN (R 4.1.3)
#>  jsonlite      1.8.0   2022-02-22 [1] CRAN (R 4.1.3)
#>  knitr         1.38    2022-03-25 [1] CRAN (R 4.1.3)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.3)
#>  lubridate     1.8.0   2021-10-07 [1] CRAN (R 4.1.3)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
#>  modelr        0.1.8   2020-05-19 [1] CRAN (R 4.1.3)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.1.3)
#>  pacman        0.5.1   2019-03-11 [1] CRAN (R 4.1.3)
#>  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.1.3)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.3)
#>  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.1.3)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.3)
#>  readr       * 2.1.2   2022-01-30 [1] CRAN (R 4.1.3)
#>  readxl        1.4.0   2022-03-28 [1] CRAN (R 4.1.3)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.3)
#>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.1.3)
#>  rmarkdown     2.13    2022-03-10 [1] CRAN (R 4.1.3)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.3)
#>  rvest         1.0.2   2021-10-16 [1] CRAN (R 4.1.3)
#>  scales        1.1.1   2020-05-11 [1] CRAN (R 4.1.3)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.3)
#>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.1.2)
#>  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.1.3)
#>  tibble      * 3.1.6   2021-11-07 [1] CRAN (R 4.1.3)
#>  tidyr       * 1.2.0   2022-02-01 [1] CRAN (R 4.1.3)
#>  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.1.3)
#>  tidyverse   * 1.3.1   2021-04-15 [1] CRAN (R 4.1.3)
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.1.3)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.3)
#>  vctrs         0.4.0   2022-03-30 [1] CRAN (R 4.1.3)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.3)
#>  xfun          0.30    2022-03-02 [1] CRAN (R 4.1.3)
#>  xml2          1.3.3   2021-11-30 [1] CRAN (R 4.1.3)
#>  yaml          2.3.5   2022-02-21 [1] CRAN (R 4.1.2)
#> 
#>  [1] C:/Users/neale/Documents/R/win-library/4.1
#>  [2] C:/Program Files/R/R-4.1.3/library
#> 
#> ------------------------------------------------------------------------------