Stuck in my de-duplication workflow

Describe your issue

  • Hello! I’m currently collecting data coming from two sources (inpatient/admitted and outpatient/clinic).
  • The variable case_number is unique and is used by the hospital for both inpatient and outpatient thus will primarily be used for deduplication.

What steps have you already taken to find an answer?

  • I am able to merge duplicates of dichotomous 1/0 variables by following the de-duplication chapter, specifically the part on roll-up values and overwrite based on hierarchy.
  • I was also able to re-classify the cases according to tiered case definitions.
  • However, I am unsure how to apply the case_when() function for dates or numbers. Once roll-up is executed, dates or numbers become a string variable separated by β€œ;” such as 2023-01-26; 2023-02-17 or 26; 27. How do I tell case_when() to take the earliest date (as in the variable date_abstracted) or the highest number (as in the variable age_years_mother) in the roll-up string?

Provide an example of your R code

# create demo dataset
demo_raw <- data.frame(
   stringsAsFactors = FALSE,
        case_number = c(101, 101, 102, 102, 103, 103, 104, 104, 104),
    date_abstracted = c("2023-04-01",
                        "2023-04-03","2023-04-02","2023-04-03","2023-04-03",
                        "2023-04-10","2023-04-10","2023-04-20","2023-05-01"),
     ssx_cataract_a = c(0, 1, 0, 0, 0, 0, 1, 1, 1),
      ssx_hearing_a = c(0, 0, 0, 0, 1, 1, 0, 0, 0),
          ssx_chd_a = c(1, 0, 1, 1, 1, 1, 0, 0, 0),
 ssx_microcephaly_b = c(0, 0, 0, 1, 0, 0, 0, 0, 0),
        ssx_delay_b = c(0, 0, 0, 0, 0, 0, 0, 0, 0),
     ssx_jaundice_b = c(0, 0, 0, 0, 0, 0, 0, 0, 0),
               ig_m = c("NA", "NA", "NA", "NA", "NA", "1", "NA", "NA", "NA"),
               ig_g = c("NA", "NA", "NA", "NA", "NA", "1", "NA", "NA", "NA"),
case_classification = c("susp","susp","susp",
                        "susp","prob","conf","susp","susp","susp"),
   age_years_mother = c("26", "26", "35", "35", "NA", "NA", "26", "27", "NA")
)

# roll-up and overwrite with hierarchy

demo_roll <- demo_raw %>%
# Roll-up values into one row and keeping only unique values
group_by(case_number) %>%
summarise(
  across(everything(),                           # apply to all columns
         ~paste0(unique(.x), collapse = "; ")))  # function is defined which combines unique non-NA values

demo_dedup <- demo_roll %>%   
# Replace text with "highest" value present in the string
mutate(across(c(starts_with("ssx_")),                              # operates on specified columns (Y/N/U)
              list(mod = ~case_when(                               # adds suffix "_mod" to new cols; implements case_when()
                
                str_detect(.x, "1")         ~ "1",                 # if "Yes" is detected, then cell value converts to yes
                str_detect(.x, "0")         ~ "0",                 # then, if "No" is detected, then cell value converts to no
                TRUE                        ~ as.character(.x)))), # then, if anything else if it kept as is
       .keep = "unused") %>%                                       # old columns removed, leaving only _mod columns
mutate(across(c(starts_with("ig_")),                               # operates on specified columns (Y/N/U)
              list(mod = ~case_when(                               # adds suffix "_mod" to new cols; implements case_when()
                
                str_detect(.x, "1")         ~ "1",                 # if "Yes" is detected, then cell value converts to yes
                str_detect(.x, "0")         ~ "0",                 # then, if "No" is detected, then cell value converts to no
                TRUE                        ~ as.character(.x)))), # then, if anything else if it kept as is
       .keep = "unused") %>%                                       # old columns removed, leaving only _mod columns
# change type to numeric
mutate(across(.cols = c(starts_with("ssx_"),
                        starts_with("ig_")),
              .fns = as.numeric)) %>% 
# Create variable adding all category A manifestations per case
group_by(case_number) %>% 
mutate(ssx_total_a = sum(across(ends_with("_a_mod"))),
       ssx_total_b = sum(across(ends_with("_b_mod")))) %>%
ungroup() %>% 
# Change case_classification according to the tiered case definitions
mutate(case_classification = case_when(
                ig_m_mod == "1" & ssx_total_a >= "1"        ~ "conf",
                ig_g_mod == "1" & ssx_total_a >= "1"        ~ "conf",
                ssx_total_a > "1"                       ~ "prob",
                ssx_total_a == "1" & ssx_total_b >= "1" ~ "prob",
                TRUE                                    ~ "susp")) # then, if anything else if it kept as is

1 Like

Hi Ian,

I would approach the problem by taking the maximum date, though you could switch this out for a minimum or something else, across all of the dates for a given case ID. Here is some code demonstrating how this can be done, note that it uses the map_vec function from the purrr package:

library(tidyverse)

# create demo dataset
demo_raw <- data.frame(
    stringsAsFactors = FALSE,
    case_number = c(101, 101, 102, 102, 103, 103, 104, 104, 104),
    date_abstracted = c(
        "2023-04-01",
        "2023-04-03",
        "2023-04-02",
        "2023-04-03",
        "2023-04-03",
        "2023-04-10",
        "2023-04-10",
        "2023-04-20",
        "2023-05-01"
    ),
    ssx_cataract_a = c(0, 1, 0, 0, 0, 0, 1, 1, 1),
    ssx_hearing_a = c(0, 0, 0, 0, 1, 1, 0, 0, 0),
    ssx_chd_a = c(1, 0, 1, 1, 1, 1, 0, 0, 0),
    ssx_microcephaly_b = c(0, 0, 0, 1, 0, 0, 0, 0, 0),
    ssx_delay_b = c(0, 0, 0, 0, 0, 0, 0, 0, 0),
    ssx_jaundice_b = c(0, 0, 0, 0, 0, 0, 0, 0, 0),
    ig_m = c("NA", "NA", "NA", "NA", "NA", "1", "NA", "NA", "NA"),
    ig_g = c("NA", "NA", "NA", "NA", "NA", "1", "NA", "NA", "NA"),
    case_classification = c(
        "susp",
        "susp",
        "susp",
        "susp",
        "prob",
        "conf",
        "susp",
        "susp",
        "susp"
    ),
    age_years_mother = c("26", "26", "35", "35", "NA", "NA", "26", "27", "NA")
)

# roll-up and overwrite with hierarchy

demo_roll <- demo_raw |>
    # Roll-up values into one row and keeping only unique values
    group_by(case_number) |>
    summarise(across(everything(),                           # apply to all columns
                                     ~ paste0(unique(.x), collapse = "; ")))  # function is defined which combines unique non-NA values

demo_roll |>
    mutate(max_date = map_vec(str_split(date_abstracted, ";"), \(x) {x |> ymd() |> max()}))
#> # A tibble: 4 Γ— 13
#>   case_number date_abstracted             ssx_cataract_a ssx_hearing_a ssx_chd_a
#>         <dbl> <chr>                       <chr>          <chr>         <chr>    
#> 1         101 2023-04-01; 2023-04-03      0; 1           0             1; 0     
#> 2         102 2023-04-02; 2023-04-03      0              0             1        
#> 3         103 2023-04-03; 2023-04-10      0              1             1        
#> 4         104 2023-04-10; 2023-04-20; 20… 1              0             0        
#> # β„Ή 8 more variables: ssx_microcephaly_b <chr>, ssx_delay_b <chr>,
#> #   ssx_jaundice_b <chr>, ig_m <chr>, ig_g <chr>, case_classification <chr>,
#> #   age_years_mother <chr>, max_date <date>

Created on 2023-05-13 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.3 (2023-03-15)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Toronto
#>  date     2023-05-13
#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.1   2023-03-23 [1] CRAN (R 4.2.0)
#>  colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.2.0)
#>  digest        0.6.31  2022-12-11 [1] CRAN (R 4.2.0)
#>  dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.2.0)
#>  evaluate      0.21    2023-05-05 [1] CRAN (R 4.2.0)
#>  fansi         1.0.4   2023-01-22 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.2.0)
#>  forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.2.2)
#>  fs            1.6.2   2023-04-25 [1] CRAN (R 4.2.0)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2     * 3.4.2   2023-04-03 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  gtable        0.3.3   2023-03-21 [1] CRAN (R 4.2.0)
#>  hms           1.1.3   2023-03-21 [1] CRAN (R 4.2.0)
#>  htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.2.0)
#>  knitr         1.42    2023-01-25 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.1)
#>  lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.2.2)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.2.3)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr       * 1.0.1   2023-01-10 [1] CRAN (R 4.2.2)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.2.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.2  2022-11-11 [1] CRAN (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.2.2)
#>  reprex        2.0.2   2022-08-17 [1] RSPM (R 4.2.1)
#>  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.2.0)
#>  rmarkdown     2.21    2023-03-26 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.14    2022-08-22 [1] RSPM (R 4.2.1)
#>  scales        1.2.1   2022-08-20 [1] RSPM (R 4.2.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.12  2023-01-11 [1] CRAN (R 4.2.2)
#>  stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.2.0)
#>  styler        1.9.1   2023-03-04 [1] CRAN (R 4.2.0)
#>  tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.2.3)
#>  tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.2.0)
#>  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.2.0)
#>  tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.2.2)
#>  timechange    0.2.0   2023-01-11 [1] CRAN (R 4.2.2)
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8          1.2.3   2023-01-31 [1] CRAN (R 4.2.2)
#>  vctrs         0.6.2   2023-04-19 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.39    2023-04-20 [1] CRAN (R 4.2.0)
#>  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Users/timothychisamore/Library/R/x86_64/4.2/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

All the best,

Tim

Hi Tim,

Thank you for your response. I tried applying map_vec function to age_years_mother variable, however it seems that when there is an NA, it takes the NA rather than the max age?

demo_merge <- demo_roll |>
  mutate(max_date = map_vec(str_split(date_abstracted, ";"), \(x) {x |> ymd() |> max()})) %>% 
  mutate(max_age = map_vec(str_split(age_years_mother, ";"), \(x) {x |> as.numeric() |> max()}))

Alternatively, when I try to add na.rm = T within max(), case 104 correctly has the right max_age of 27, but case 103 with only an NA in age_years_mother gets a -Inf in max_age.

1 Like

Hi Ian,

You are correct in how to handle missing values in the max function, as well as many others within R. However, you will need to decide whether this is how you want to handle the situation, or if some other summary may be better.

With regards to why -Inf is returned for case 103, that is because there is only a single value for age_years_mother and it happens to be NA. As a result, since you’ve told R to remove NAs, the lowest possible value you could have is -Inf so it is returned. You could write additional code to check first if the only value(s) in these variable are NA and to return NA if so. Again, this will ultimately require you to decide how you want to handle the data, I don’t think there’s any one answer that is the absolute best.

Finally, I think the following code is a better solution for the problem:

library(tidyverse)

# create demo dataset
demo_raw <- data.frame(
    stringsAsFactors = FALSE,
    case_number = c(101, 101, 102, 102, 103, 103, 104, 104, 104),
    date_abstracted = c(
        "2023-04-01",
        "2023-04-03",
        "2023-04-02",
        "2023-04-03",
        "2023-04-03",
        "2023-04-10",
        "2023-04-10",
        "2023-04-20",
        "2023-05-01"
    ),
    ssx_cataract_a = c(0, 1, 0, 0, 0, 0, 1, 1, 1),
    ssx_hearing_a = c(0, 0, 0, 0, 1, 1, 0, 0, 0),
    ssx_chd_a = c(1, 0, 1, 1, 1, 1, 0, 0, 0),
    ssx_microcephaly_b = c(0, 0, 0, 1, 0, 0, 0, 0, 0),
    ssx_delay_b = c(0, 0, 0, 0, 0, 0, 0, 0, 0),
    ssx_jaundice_b = c(0, 0, 0, 0, 0, 0, 0, 0, 0),
    ig_m = c("NA", "NA", "NA", "NA", "NA", "1", "NA", "NA", "NA"),
    ig_g = c("NA", "NA", "NA", "NA", "NA", "1", "NA", "NA", "NA"),
    case_classification = c(
        "susp",
        "susp",
        "susp",
        "susp",
        "prob",
        "conf",
        "susp",
        "susp",
        "susp"
    ),
    age_years_mother = c("26", "26", "35", "35", "NA", "NA", "26", "27", "NA")
)

# roll-up and overwrite with hierarchy

demo_roll <- demo_raw |>
    # Roll-up values into one row and keeping only unique values
    group_by(case_number) |>
    summarise(across(everything(),                           # apply to all columns
                                     ~ paste0(unique(.x), collapse = "; ")))  # function is defined which combines unique non-NA values

demo_roll |>
    mutate(
        max_date = map(str_split(date_abstracted, ";"), ymd),
        max_age = map(str_split(age_years_mother, ";"), as.integer)
    ) |>
    rowwise() |>
    mutate(max_date = max(max_date, na.rm = TRUE),
                 max_age = max(max_age, na.rm = TRUE)) |>
    ungroup()
#> Warning: There were 2 warnings in `mutate()`.
#> The first warning was:
#> β„Ή In argument: `max_age = map(str_split(age_years_mother, ";"), as.integer)`.
#> Caused by warning:
#> ! NAs introduced by coercion
#> β„Ή Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#> Warning: There was 1 warning in `mutate()`.
#> β„Ή In argument: `max_age = max(max_age, na.rm = TRUE)`.
#> β„Ή In row 3.
#> Caused by warning in `max()`:
#> ! no non-missing arguments to max; returning -Inf
#> # A tibble: 4 Γ— 14
#>   case_number date_abstracted             ssx_cataract_a ssx_hearing_a ssx_chd_a
#>         <dbl> <chr>                       <chr>          <chr>         <chr>    
#> 1         101 2023-04-01; 2023-04-03      0; 1           0             1; 0     
#> 2         102 2023-04-02; 2023-04-03      0              0             1        
#> 3         103 2023-04-03; 2023-04-10      0              1             1        
#> 4         104 2023-04-10; 2023-04-20; 20… 1              0             0        
#> # β„Ή 9 more variables: ssx_microcephaly_b <chr>, ssx_delay_b <chr>,
#> #   ssx_jaundice_b <chr>, ig_m <chr>, ig_g <chr>, case_classification <chr>,
#> #   age_years_mother <chr>, max_date <date>, max_age <dbl>

Created on 2023-05-17 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.3 (2023-03-15)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Toronto
#>  date     2023-05-17
#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.1   2023-03-23 [1] CRAN (R 4.2.0)
#>  colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.2.0)
#>  digest        0.6.31  2022-12-11 [1] CRAN (R 4.2.0)
#>  dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.2.0)
#>  evaluate      0.21    2023-05-05 [1] CRAN (R 4.2.0)
#>  fansi         1.0.4   2023-01-22 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.2.0)
#>  forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.2.2)
#>  fs            1.6.2   2023-04-25 [1] CRAN (R 4.2.0)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2     * 3.4.2   2023-04-03 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  gtable        0.3.3   2023-03-21 [1] CRAN (R 4.2.0)
#>  hms           1.1.3   2023-03-21 [1] CRAN (R 4.2.0)
#>  htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.2.0)
#>  knitr         1.42    2023-01-25 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.1)
#>  lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.2.2)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.2.3)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr       * 1.0.1   2023-01-10 [1] CRAN (R 4.2.2)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.2.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.2  2022-11-11 [1] CRAN (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.2.2)
#>  reprex        2.0.2   2022-08-17 [1] RSPM (R 4.2.1)
#>  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.2.0)
#>  rmarkdown     2.21    2023-03-26 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.14    2022-08-22 [1] RSPM (R 4.2.1)
#>  scales        1.2.1   2022-08-20 [1] RSPM (R 4.2.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.12  2023-01-11 [1] CRAN (R 4.2.2)
#>  stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.2.0)
#>  styler        1.9.1   2023-03-04 [1] CRAN (R 4.2.0)
#>  tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.2.3)
#>  tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.2.0)
#>  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.2.0)
#>  tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.2.2)
#>  timechange    0.2.0   2023-01-11 [1] CRAN (R 4.2.2)
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8          1.2.3   2023-01-31 [1] CRAN (R 4.2.2)
#>  vctrs         0.6.2   2023-04-19 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.39    2023-04-20 [1] CRAN (R 4.2.0)
#>  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Users/timothychisamore/Library/R/x86_64/4.2/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

All the best,

Tim

Hi Tim,

With my dataset, I would like to get the max of date or age but if the only value after the β€œroll” is NA, then I would like to retain it as NA rather than -Inf. Adding mutate(max_age = na_if(max_age, -Inf)) at the end of the code after ungroup() seems to do the job, but is there a better way of doing it?

Thank you.

demo_merge <- demo_roll |>
  mutate(
    max_date = map(str_split(date_abstracted, ";"), ymd),
    max_age = map(str_split(age_years_mother, ";"), as.integer)
  ) |>
  rowwise() |>
  mutate(max_date = max(max_date, na.rm = TRUE),
         max_age = max(max_age, na.rm = TRUE)) |>
  ungroup() %>% 
  mutate(max_age = na_if(max_age, -Inf))
1 Like

Hi Ian,

That will definitely do the job and I don’t think there’s any β€œright way” per se. My preference would be to control for this upfront but I think either approach works, here is an example:

library(tidyverse)

# create demo dataset
demo_raw <- data.frame(
    stringsAsFactors = FALSE,
    case_number = c(101, 101, 102, 102, 103, 103, 104, 104, 104),
    date_abstracted = c(
        "2023-04-01",
        "2023-04-03",
        "2023-04-02",
        "2023-04-03",
        "2023-04-03",
        "2023-04-10",
        "2023-04-10",
        "2023-04-20",
        "2023-05-01"
    ),
    ssx_cataract_a = c(0, 1, 0, 0, 0, 0, 1, 1, 1),
    ssx_hearing_a = c(0, 0, 0, 0, 1, 1, 0, 0, 0),
    ssx_chd_a = c(1, 0, 1, 1, 1, 1, 0, 0, 0),
    ssx_microcephaly_b = c(0, 0, 0, 1, 0, 0, 0, 0, 0),
    ssx_delay_b = c(0, 0, 0, 0, 0, 0, 0, 0, 0),
    ssx_jaundice_b = c(0, 0, 0, 0, 0, 0, 0, 0, 0),
    ig_m = c("NA", "NA", "NA", "NA", "NA", "1", "NA", "NA", "NA"),
    ig_g = c("NA", "NA", "NA", "NA", "NA", "1", "NA", "NA", "NA"),
    case_classification = c(
        "susp",
        "susp",
        "susp",
        "susp",
        "prob",
        "conf",
        "susp",
        "susp",
        "susp"
    ),
    age_years_mother = c("26", "26", "35", "35", "NA", "NA", "26", "27", "NA")
)

# roll-up and overwrite with hierarchy

demo_roll <- demo_raw |>
    # Roll-up values into one row and keeping only unique values
    group_by(case_number) |>
    summarise(across(everything(),                           # apply to all columns
                                     ~ paste0(unique(.x), collapse = "; ")))  # function is defined which combines unique non-NA values

demo_roll |>
    mutate(dates = map(str_split(date_abstracted, "; "), ymd),
                 ages = map(
                    str_split(age_years_mother, "; "),
                    \(x) as.integer(na_if(x, "NA")))
                 ) |>
    rowwise() |>
    mutate(max_date = if_else(all(is.na(dates)), NA_Date_, max(dates, na.rm = TRUE)),
                 max_age = if_else(all(is.na(ages)), NA_integer_, max(ages, na.rm = TRUE))) |>
    ungroup()
#> Warning: There was 1 warning in `mutate()`.
#> β„Ή In argument: `max_age = if_else(all(is.na(ages)), NA_integer_, max(ages,
#>   na.rm = TRUE))`.
#> β„Ή In row 3.
#> Caused by warning in `max()`:
#> ! no non-missing arguments to max; returning -Inf
#> # A tibble: 4 Γ— 16
#>   case_number date_abstracted             ssx_cataract_a ssx_hearing_a ssx_chd_a
#>         <dbl> <chr>                       <chr>          <chr>         <chr>    
#> 1         101 2023-04-01; 2023-04-03      0; 1           0             1; 0     
#> 2         102 2023-04-02; 2023-04-03      0              0             1        
#> 3         103 2023-04-03; 2023-04-10      0              1             1        
#> 4         104 2023-04-10; 2023-04-20; 20… 1              0             0        
#> # β„Ή 11 more variables: ssx_microcephaly_b <chr>, ssx_delay_b <chr>,
#> #   ssx_jaundice_b <chr>, ig_m <chr>, ig_g <chr>, case_classification <chr>,
#> #   age_years_mother <chr>, dates <list>, ages <list>, max_date <date>,
#> #   max_age <dbl>

Created on 2023-05-18 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.3 (2023-03-15)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Toronto
#>  date     2023-05-18
#>  pandoc   2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.1   2023-03-23 [1] CRAN (R 4.2.0)
#>  colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.2.0)
#>  digest        0.6.31  2022-12-11 [1] CRAN (R 4.2.0)
#>  dplyr       * 1.1.2   2023-04-20 [1] CRAN (R 4.2.0)
#>  evaluate      0.21    2023-05-05 [1] CRAN (R 4.2.0)
#>  fansi         1.0.4   2023-01-22 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.2.0)
#>  forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.2.2)
#>  fs            1.6.2   2023-04-25 [1] CRAN (R 4.2.0)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2     * 3.4.2   2023-04-03 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  gtable        0.3.3   2023-03-21 [1] CRAN (R 4.2.0)
#>  hms           1.1.3   2023-03-21 [1] CRAN (R 4.2.0)
#>  htmltools     0.5.5   2023-03-23 [1] CRAN (R 4.2.0)
#>  knitr         1.42    2023-01-25 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.1)
#>  lubridate   * 1.9.2   2023-02-10 [1] CRAN (R 4.2.2)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.2.3)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr       * 1.0.1   2023-01-10 [1] CRAN (R 4.2.2)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.2.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.2  2022-11-11 [1] CRAN (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  readr       * 2.1.4   2023-02-10 [1] CRAN (R 4.2.2)
#>  reprex        2.0.2   2022-08-17 [1] RSPM (R 4.2.1)
#>  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.2.0)
#>  rmarkdown     2.21    2023-03-26 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.14    2022-08-22 [1] RSPM (R 4.2.1)
#>  scales        1.2.1   2022-08-20 [1] RSPM (R 4.2.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.12  2023-01-11 [1] CRAN (R 4.2.2)
#>  stringr     * 1.5.0   2022-12-02 [1] CRAN (R 4.2.0)
#>  styler        1.9.1   2023-03-04 [1] CRAN (R 4.2.0)
#>  tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.2.3)
#>  tidyr       * 1.3.0   2023-01-24 [1] CRAN (R 4.2.0)
#>  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.2.0)
#>  tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.2.2)
#>  timechange    0.2.0   2023-01-11 [1] CRAN (R 4.2.2)
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8          1.2.3   2023-01-31 [1] CRAN (R 4.2.2)
#>  vctrs         0.6.2   2023-04-19 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.39    2023-04-20 [1] CRAN (R 4.2.0)
#>  yaml          2.3.7   2023-01-23 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Users/timothychisamore/Library/R/x86_64/4.2/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

All the best,

Tim

Thank you, Tim! Could you elaborate a bit more regarding this line of code: \(x) as.integer(na_if(x, "NA"))

It seems that it will apply NA only to variable ages if the value is β€œNA”, otherwise it will change the class to integer? However, for the variable dates, there is no code to apply NA if the date value is β€œNA”?

1 Like

Hi Ian,

You will notice that str_split() returns missing values as characters, i.e., as β€œNA” and not as a true NA value. As a result, as.integer() throws a warning as it cannot coerce them appropriately. The above step converts the character β€œNA” values into true NA values to avoid the warning being thrown. You could do the same for the date field as well, however, I didn’t bother since there were no β€œNA” values there.

All the best,

Tim

1 Like

Hi Tim,

Thank you for the explanation. It seems that this process requires the creation and retention of new variables (dates, ages, max_date, and max_age). I tried to directly overwrite the variables date_abstracted and age_years_mother with the following code:

demo_roll |>
    mutate(date_abstracted = map(str_split(date_abstracted, "; "), ymd),
                 age_years_mother = map(
                    str_split(age_years_mother, "; "),
                    \(x) as.integer(na_if(x, "NA")))
                 ) |>
    rowwise() |>
    mutate(date_abstracted = if_else(all(is.na(date_abstracted)), NA_Date_, max(date_abstracted, na.rm = TRUE)),
                 age_years_mother = if_else(all(is.na(age_years_mother)), NA_integer_, max(age_years_mother, na.rm = TRUE))) |>
    ungroup()

The process of creating new variables becomes unwieldy when applying the script to the full dataset which requires the deduplication of >100 variables. However, is there an advantage of creating new variables rather than directly overwriting?

Best,
Ian

1 Like

Hi Ian,

I think it’s a matter of preference - if you see no value in retaining the original variables then I don’t think there is any need to do so. However, I would encourage you to keep your raw data unaltered both as a separate object during your analysis and within a read only file.

All the best,

Tim

1 Like

Thank you for your insights, Tim. I see the value of retaining the original variables. What would the value be for retaining the interim variables of dates and ages aside from the final variables of max_date and max_age?

1 Like

Hi Ian,

Without knowing too much about your process, it would be hard to say whether there is any value in keeping the interim variables. If you were early in your analysis and doing more exploratory work or if you were validating the deduplicated data then I could see value in keeping the intermediate variables. However, if you are just producing the deduplicated data then I don’t see any need in keeping the intermediate variables.

All the best,

Tim

1 Like

Thank you, Tim. Your explanations are always helpful.

Ian

1 Like