Split variable orderly to create dummy variable

Hello Applied Epi community
I am trying to clean my data and facing issues and not able to find help in online resources. The data is entered such that the variable has several responses entered in each row (separated by commas).
I am trying to split these responses into several columns that should have one response each. But my script is causing a problem.

I need help to move the newly created split variables (after split) orderly in each column.

For example, I need all entries labeled “6”, “7”, or “11” in one respective column each. A column should have 6 only (if it is entered for that subject and the remaining in that column should be NAs). The same goes for the next column of 7 or 11.

Create dummy variables (binary)

The reason I want this is to create dummy variables based on entries (either 6, 7, or 11 in each column).
Is it possible to pipe into a script here so that a 0, 1 (binary/dummy) variable gets created within the same script?

I am sharing the reprex below, if you can look into for help please.

pacman::p_load(rio, lubridate, datapasta, reprex, tidyverse) 

df_test <- tibble::tribble(~study_no, ~table2_major_symptoms,1L,"11",2L,"11",5L,"1,6,11",6L,"6,7,11",7L,"7,8,11",15L,"14 (psoas abscess)")

# Split variable --------------------------------------------------------
df_test <- df_test  %>% separate_wider_delim(`table2_major_symptoms` , 
                                           delim = "," ,
                                           names = c("a" , "b" , "c" , "d") ,
                                           too_few = "align_start",
                                           too_many = "merge",
                                           cols_remove = FALSE
                                           )

Created on 2023-10-14 with reprex v2.0.2

Session info
sessionInfo()
#> R version 4.3.1 (2023-06-16 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=Norwegian Bokmål_Norway.utf8 
#> [2] LC_CTYPE=Norwegian Bokmål_Norway.utf8   
#> [3] LC_MONETARY=Norwegian Bokmål_Norway.utf8
#> [4] LC_NUMERIC=C                            
#> [5] LC_TIME=Norwegian Bokmål_Norway.utf8    
#> 
#> time zone: Europe/Paris
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] forcats_1.0.0   stringr_1.5.0   dplyr_1.1.2     purrr_1.0.1    
#>  [5] readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.2  
#>  [9] tidyverse_2.0.0 reprex_2.0.2    datapasta_3.1.0 lubridate_1.9.2
#> [13] rio_0.5.29     
#> 
#> loaded via a namespace (and not attached):
#>  [1] utf8_1.2.3        generics_0.1.3    stringi_1.7.12    hms_1.1.3        
#>  [5] digest_0.6.33     magrittr_2.0.3    evaluate_0.21     grid_4.3.1       
#>  [9] timechange_0.2.0  fastmap_1.1.1     cellranger_1.1.0  zip_2.3.0        
#> [13] fansi_1.0.4       scales_1.2.1      cli_3.6.1         rlang_1.1.1      
#> [17] munsell_0.5.0     withr_2.5.0       yaml_2.3.7        tools_4.3.1      
#> [21] tzdb_0.4.0        colorspace_2.1-0  pacman_0.5.1      curl_5.0.1       
#> [25] vctrs_0.6.3       R6_2.5.1          lifecycle_1.0.3   fs_1.6.3         
#> [29] foreign_0.8-84    pkgconfig_2.0.3   pillar_1.9.0      openxlsx_4.2.5.2 
#> [33] gtable_0.3.3      data.table_1.14.8 glue_1.6.2        Rcpp_1.0.11      
#> [37] haven_2.5.3       xfun_0.39         tidyselect_1.2.0  rstudioapi_0.15.0
#> [41] knitr_1.43        htmltools_0.5.5   rmarkdown_2.23    compiler_4.3.1   
#> [45] readxl_1.4.3
1 Like

Hello,

I believe the following approach may be better for creating these binary indictors:

# Loading packages (lubridate is included in tidyverse now)
library(tidyverse)

df_test <- tribble(~study_no, ~table2_major_symptoms, 
                                     1L, "11",
                                     2L, "11",
                                     5L, "1,6,11",
                                     6L, "6,7,11",
                                     7L, "7,8,11",
                                     15L, "14 (psoas abscess)")

# Creating binary indicators for each variable by checking
# if it is detected
df_test |>
  mutate(
    split = str_split(table2_major_symptoms, ","),
    has_6 = str_detect(table2_major_symptoms, "6"),
    has_7 = str_detect(table2_major_symptoms, "7"),
    has_11 = str_detect(table2_major_symptoms, "11")
  ) |>
    select(-split)
#> # A tibble: 6 × 5
#>   study_no table2_major_symptoms has_6 has_7 has_11
#>      <int> <chr>                 <lgl> <lgl> <lgl> 
#> 1        1 11                    FALSE FALSE TRUE  
#> 2        2 11                    FALSE FALSE TRUE  
#> 3        5 1,6,11                TRUE  FALSE TRUE  
#> 4        6 6,7,11                TRUE  TRUE  TRUE  
#> 5        7 7,8,11                FALSE TRUE  TRUE  
#> 6       15 14 (psoas abscess)    FALSE FALSE FALSE

Created on 2023-10-14 with reprex v2.0.2

Session info
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.5.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Toronto
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.3    
#>  [5] purrr_1.0.2     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
#>  [9] ggplot2_3.4.3   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.4      compiler_4.3.1    reprex_2.0.2      tidyselect_1.2.0 
#>  [5] scales_1.2.1      yaml_2.3.7        fastmap_1.1.1     R6_2.5.1         
#>  [9] generics_0.1.3    knitr_1.44        munsell_0.5.0     R.cache_0.16.0   
#> [13] tzdb_0.4.0        pillar_1.9.0      R.utils_2.12.2    rlang_1.1.1      
#> [17] utf8_1.2.3        stringi_1.7.12    xfun_0.40         fs_1.6.3         
#> [21] timechange_0.2.0  cli_3.6.1         withr_2.5.0       magrittr_2.0.3   
#> [25] digest_0.6.33     grid_4.3.1        rstudioapi_0.15.0 hms_1.1.3        
#> [29] lifecycle_1.0.3   R.methodsS3_1.8.2 R.oo_1.25.0       vctrs_0.6.3      
#> [33] evaluate_0.21     glue_1.6.2        styler_1.10.2     fansi_1.0.4      
#> [37] colorspace_2.1-0  rmarkdown_2.25    tools_4.3.1       pkgconfig_2.0.3  
#> [41] htmltools_0.5.6

Of course, should you have other symptoms to add you could generalize this approach.

All the best,

Tim

1 Like

Hello Tim, @machupovirus
Thank you, very helpful! It really solved what I was looking for.
To adapt, I have two queries below if you can please leave a response!

df_test <- tibble::tribble(~study_no, ~table2_major_symptoms,1L,"11",2L,"11",5L,"1,6,11",6L,"6,7,11",7L,"7,8,11",15L,"14 (psoas abscess)")

Binary_data <-  df_test |>
  mutate(
    split = str_split(table2_major_symptoms, ","),
    has_6 = str_detect(table2_major_symptoms, "6"),
    has_7 = str_detect(table2_major_symptoms, "7"),
    has_11 = str_detect(table2_major_symptoms, "11")
  ) |>
  select(-split)           

# Can you plz share what "select(-split)" would do in the line above? 
# For the steps below, is it possible to do it above suggested solution with fewer codes and as part of the script suggested above while using "mutate"? As I still need to have numeric 1,0. 

# Using ifelse to create numeric (1,0) for True/False
has_6_new <- ifelse(Binary_data$has_6 == 'TRUE', 1, 0)
has_7_new <- ifelse(Binary_data$has_7 == 'TRUE', 1, 0)

# Binding the newly created column with main dataset
Binary_data <- bind_cols(Binary_data , has_6_new , has_7_new)

# Renaming the newly created columns 
names(Binary_data)[10:11] = c("has_6_new" , "has_7_new")
names(Binary_data)

``` r
1 Like

Hello,

select(-split) is just removing the split variable from our data since we no longer need it, rather, it was just a variable we used in order to derive our new indicators. For example, look at the outcome of this code when I remove that part:

# Loading packages (lubridate is included in tidyverse now)
library(tidyverse)

df_test <- tribble(~study_no, ~table2_major_symptoms, 
                                     1L, "11",
                                     2L, "11",
                                     5L, "1,6,11",
                                     6L, "6,7,11",
                                     7L, "7,8,11",
                                     15L, "14 (psoas abscess)")

# Creating binary indicators for each variable by checking
# if it is detected
df_test |>
  mutate(
    split = str_split(table2_major_symptoms, ","),
    has_6 = str_detect(table2_major_symptoms, "6"),
    has_7 = str_detect(table2_major_symptoms, "7"),
    has_11 = str_detect(table2_major_symptoms, "11")
  ) 
#> # A tibble: 6 × 6
#>   study_no table2_major_symptoms split     has_6 has_7 has_11
#>      <int> <chr>                 <list>    <lgl> <lgl> <lgl> 
#> 1        1 11                    <chr [1]> FALSE FALSE TRUE  
#> 2        2 11                    <chr [1]> FALSE FALSE TRUE  
#> 3        5 1,6,11                <chr [3]> TRUE  FALSE TRUE  
#> 4        6 6,7,11                <chr [3]> TRUE  TRUE  TRUE  
#> 5        7 7,8,11                <chr [3]> FALSE TRUE  TRUE  
#> 6       15 14 (psoas abscess)    <chr [1]> FALSE FALSE FALSE

Created on 2023-10-15 with reprex v2.0.2

Session info
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.5.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Toronto
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.3    
#>  [5] purrr_1.0.2     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
#>  [9] ggplot2_3.4.3   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.4      compiler_4.3.1    reprex_2.0.2      tidyselect_1.2.0 
#>  [5] scales_1.2.1      yaml_2.3.7        fastmap_1.1.1     R6_2.5.1         
#>  [9] generics_0.1.3    knitr_1.44        munsell_0.5.0     R.cache_0.16.0   
#> [13] tzdb_0.4.0        pillar_1.9.0      R.utils_2.12.2    rlang_1.1.1      
#> [17] utf8_1.2.3        stringi_1.7.12    xfun_0.40         fs_1.6.3         
#> [21] timechange_0.2.0  cli_3.6.1         withr_2.5.0       magrittr_2.0.3   
#> [25] digest_0.6.33     grid_4.3.1        rstudioapi_0.15.0 hms_1.1.3        
#> [29] lifecycle_1.0.3   R.methodsS3_1.8.2 R.oo_1.25.0       vctrs_0.6.3      
#> [33] evaluate_0.21     glue_1.6.2        styler_1.10.2     fansi_1.0.4      
#> [37] colorspace_2.1-0  rmarkdown_2.25    tools_4.3.1       pkgconfig_2.0.3  
#> [41] htmltools_0.5.6

With respect to your second question, while logicals are essentially equivalent to a 0/1 indicator, you can easily tell R to force logical variables into integers that are either 0 or 1, as seen below:

# Loading packages (lubridate is included in tidyverse now)
library(tidyverse)

df_test <- tribble(~study_no, ~table2_major_symptoms, 
                                     1L, "11",
                                     2L, "11",
                                     5L, "1,6,11",
                                     6L, "6,7,11",
                                     7L, "7,8,11",
                                     15L, "14 (psoas abscess)")

# Creating binary indicators for each variable by checking
# if it is detected
df_test |>
  mutate(
    split = str_split(table2_major_symptoms, ","),
    has_6 = as.integer(str_detect(table2_major_symptoms, "6")),
    has_7 = as.integer(str_detect(table2_major_symptoms, "7")),
    has_11 = as.integer(str_detect(table2_major_symptoms, "11"))
  ) |>
    select(-split)
#> # A tibble: 6 × 5
#>   study_no table2_major_symptoms has_6 has_7 has_11
#>      <int> <chr>                 <int> <int>  <int>
#> 1        1 11                        0     0      1
#> 2        2 11                        0     0      1
#> 3        5 1,6,11                    1     0      1
#> 4        6 6,7,11                    1     1      1
#> 5        7 7,8,11                    0     1      1
#> 6       15 14 (psoas abscess)        0     0      0

Created on 2023-10-15 with reprex v2.0.2

Session info
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.5.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Toronto
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.3    
#>  [5] purrr_1.0.2     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
#>  [9] ggplot2_3.4.3   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.4      compiler_4.3.1    reprex_2.0.2      tidyselect_1.2.0 
#>  [5] scales_1.2.1      yaml_2.3.7        fastmap_1.1.1     R6_2.5.1         
#>  [9] generics_0.1.3    knitr_1.44        munsell_0.5.0     R.cache_0.16.0   
#> [13] tzdb_0.4.0        pillar_1.9.0      R.utils_2.12.2    rlang_1.1.1      
#> [17] utf8_1.2.3        stringi_1.7.12    xfun_0.40         fs_1.6.3         
#> [21] timechange_0.2.0  cli_3.6.1         withr_2.5.0       magrittr_2.0.3   
#> [25] digest_0.6.33     grid_4.3.1        rstudioapi_0.15.0 hms_1.1.3        
#> [29] lifecycle_1.0.3   R.methodsS3_1.8.2 R.oo_1.25.0       vctrs_0.6.3      
#> [33] evaluate_0.21     glue_1.6.2        styler_1.10.2     fansi_1.0.4      
#> [37] colorspace_2.1-0  rmarkdown_2.25    tools_4.3.1       pkgconfig_2.0.3  
#> [41] htmltools_0.5.6

All the best,

Tim

2 Likes

That’s awesome, Thank you so much!!

1 Like