I have been tasked with analyzing some linelist data where there is a column for risk factors such that all of the factors present for a given case are concatenated into a character separated by commas. I am interested in producing the frequencies that each risk factor appear in the data but canβt find any simple solution.
Here is an example of what the linelist data looks like:
library(tidyverse)
tibble::tribble(
~id, ~age, ~age_unit, ~risk_factors,
1, 20, "years", "diabetes, hypertension",
2, 26, "years", "immunocompromised",
3, 24, "years", "lupus, hypertension, immunocompromised"
)
#> # A tibble: 3 Γ 4
#> id age age_unit risk_factors
#> <dbl> <dbl> <chr> <chr>
#> 1 1 20 years diabetes, hypertension
#> 2 2 26 years immunocompromised
#> 3 3 24 years lupus, hypertension, immunocompromised
There is actually a very handy function in the tidyr package that will help. The function is named separate_rows and will essentially break the concatenated risk factors character up at a given delimiter and create a row for each with the values for all other variables held constant.
Here is a quick example where the linelist data is converted to long data and then the frequencies for each risk factor are produced:
#Loading tidyverse
library(tidyverse)
#Generating linelist data
linelist_data <-
tibble(
id = seq_len(length.out = 3),
age = rpois(n = 3, lambda = 25),
age_unit = rep(x = "years", 3),
risk_factors = c(
"diabetes, hypertension",
"immunocompromised",
"lupus, hypertension, immunocompromised"
)
)
#Creating long data
long_data <-
linelist_data |>
separate_rows(risk_factors, sep = ", ")
#Examining the long data
long_data |>
slice_head(n = 5)
#> # A tibble: 5 Γ 4
#> id age age_unit risk_factors
#> <int> <int> <chr> <chr>
#> 1 1 18 years diabetes
#> 2 1 18 years hypertension
#> 3 2 31 years immunocompromised
#> 4 3 25 years lupus
#> 5 3 25 years hypertension
#Counting risk factors
long_data |>
count(risk_factors)
#> # A tibble: 4 Γ 2
#> risk_factors n
#> <chr> <int>
#> 1 diabetes 1
#> 2 hypertension 2
#> 3 immunocompromised 2
#> 4 lupus 1
Timβs answer is great, and below I offer some additional code using separate() in case you want to split the symptoms into their own columns (not as great for counting, but perhaps for other purposes).
# load packages
pacman::p_load(tidyverse)
# create example data
linelist_data <-
tibble(
id = seq_len(length.out = 3),
age = rpois(n = 3, lambda = 25),
age_unit = rep(x = "years", 3),
risk_factors = c(
"diabetes, hypertension",
"immunocompromised",
"lupus, hypertension, immunocompromised"
)
)
# separate risk factors into columns
split_wide <- linelist_data %>%
separate(risk_factors, into = c("sym_1", "sym_2", "sym_3"), sep=",")
#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [1, 2].
# print wide data
split_wide
#> # A tibble: 3 x 6
#> id age age_unit sym_1 sym_2 sym_3
#> <int> <int> <chr> <chr> <chr> <chr>
#> 1 1 25 years diabetes " hypertension" <NA>
#> 2 2 25 years immunocompromised <NA> <NA>
#> 3 3 26 years lupus " hypertension" " immunocompromised"