Deduplicating one column based on values in a second column

Kate · May 23, 2022, 5:42pm

Hi All,
I have a deduplication problem. For deduplication we are using a combination of name and phone number, however phone number can be in one of two columns: “home phone” or “cell phone”
Sometimes the “home phone” value ends up in the “cell phone” column and vice versa.

The data looks like this:

name ← c(“Mike”, “Billy”, “Tom”, “Emilie”, “Geo”, “Monique”, “Emilie”, “Mike”,“Tom”)
home_phone ← c(“831-458-8050”, “654-493-7589”, “512-203-0917”,“631-485-8157”,“419-301-7861”,“920-337-5912”, “NA”, “831-458-8050”, “916-698-4843”)
cell_phone ← c(“NA”,“803-445-0053”,“NA”,“626-831-6981”, “405-364-7320”, “530-457-2565”, “631-485-8157”, “831-738-2701”,“512-203-0917”)

df ← data.frame(name, home_phone, cell_phone)

In the example above Mike is clearly a duplicate as he has the same value for home phone twice, but Emilie and Tom are also duplicates but they don’t look like it because the home phone value is in the cell phone column.

Any help with this would be appreciated.

neale · May 24, 2022, 10:19pm

Hi Kate,

Thanks for posting in the site! Here is one way to approach this, that I hope gets what you want:

pacman::p_load(tidyverse, janitor)

name <- c("Mike", "Billy", "Tom", "Emilie", "Geo", "Monique", "Emilie", "Mike","Tom")
home_phone <- c("831-458-8050", "654-493-7589", "512-203-0917","631-485-8157","419-301-7861","920-337-5912", "NA", "831-458-8050", "916-698-4843")
cell_phone <- c("NA","803-445-0053","NA", "626-831-6981", "405-364-7320", "530-457-2565", "631-485-8157", "831-738-2701","512-203-0917")
df <- data.frame(name, home_phone, cell_phone)

# names and home phones
home_phones <- df %>% 
     select(name, home_phone)

# names and cell phones
cell_phones <- df %>% 
     select(name, cell_phone)

# rows where name matches, and home phone matches cell phone
semi_join(home_phones, cell_phones, by = c("name", "home_phone" = "cell_phone"))
#>     name   home_phone
#> 1    Tom 512-203-0917
#> 2 Emilie 631-485-8157

^{Created on 2022-05-24 by the reprex package (v2.0.1)}

Also, a small tip for the future: If you put three backticks above and below your code in the post, it appear as a kind of code-text in the post (or highlight it and click the code icon next to the quote mark icon). As it was written in your post, I had to manually convert all the arrows and quote marks when I copied and pasted into R.

←
“Mike”

vs.

```
<-
"Mike"
```

Or use the {reprex} package as described here

machupovirus · May 24, 2022, 10:56pm

Hi @Kate,

In addition to a great answer by @neale, here is another possible solution. We look for records with a non-misssing home phone value that is in the cell phone values or a non-missing cell phone value that is in the home phone values. Additionally, we also handle the case where the same name and home phone value are used.

# loading packages
library(tidyverse)

# creating the data
df <- tibble(
    name = c("Mike", "Billy", "Tom", "Emilie", "Geo", "Monique", "Emilie", "Mike", "Tom"),
    home_phone = c("831-458-8050", "654-493-7589", "512-203-0917","631-485-8157","419-301-7861","920-337-5912", "NA", "831-458-8050", "916-698-4843"),
    cell_phone = c("NA", "803-445-0053", "NA", "626-831-6981", "405-364-7320", "530-457-2565", "631-485-8157", "831-738-2701", "512-203-0917")
)

# returning names where home_phone is in cell_phone or cell_phone is in
# home_phone
duplicate_names <- df |>
    filter((home_phone != "NA" & home_phone %in% df$cell_phone) |
                    (cell_phone != "NA" & cell_phone %in% df$home_phone)) |>
    pull(name) |>
    unique()

# returning all records in df that are duplicates
df |>
    add_count(name, home_phone) |>
    filter(name %in% duplicate_names | n > 1) |>
    select(-n)
#> # A tibble: 6 × 3
#>   name   home_phone   cell_phone  
#>   <chr>  <chr>        <chr>       
#> 1 Mike   831-458-8050 NA          
#> 2 Tom    512-203-0917 NA          
#> 3 Emilie 631-485-8157 626-831-6981
#> 4 Emilie NA           631-485-8157
#> 5 Mike   831-458-8050 831-738-2701
#> 6 Tom    916-698-4843 512-203-0917

^{Created on 2022-05-24 by the reprex package (v2.0.1)}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.1.3 (2022-03-10)
#>  os       macOS Big Sur/Monterey 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_CA.UTF-8
#>  ctype    en_CA.UTF-8
#>  tz       America/Toronto
#>  date     2022-05-24
#>  pandoc   2.17.1.1 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
#>  backports     1.4.1   2021-12-13 [1] CRAN (R 4.1.0)
#>  broom         0.8.0   2022-04-13 [1] CRAN (R 4.1.3)
#>  cellranger    1.1.0   2016-07-27 [1] CRAN (R 4.1.0)
#>  cli           3.3.0   2022-04-25 [1] CRAN (R 4.1.3)
#>  colorspace    2.0-3   2022-02-21 [1] RSPM (R 4.1.2)
#>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.1.3)
#>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.1.1)
#>  dbplyr        2.1.1   2021-04-06 [1] CRAN (R 4.1.0)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.1.1)
#>  dplyr       * 1.0.9   2022-04-28 [1] CRAN (R 4.1.3)
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
#>  evaluate      0.15    2022-02-18 [1] RSPM (R 4.1.2)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.1.3)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.1.0)
#>  forcats     * 0.5.1   2021-01-27 [1] CRAN (R 4.1.0)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.1.1)
#>  generics      0.1.2   2022-01-31 [1] RSPM (R 4.1.2)
#>  ggplot2     * 3.3.6   2022-05-03 [1] CRAN (R 4.1.3)
#>  glue          1.6.2   2022-02-24 [1] RSPM (R 4.1.2)
#>  gtable        0.3.0   2019-03-25 [1] CRAN (R 4.1.0)
#>  haven         2.5.0   2022-04-15 [1] CRAN (R 4.1.3)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.1.0)
#>  hms           1.1.1   2021-09-26 [1] CRAN (R 4.1.1)
#>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.0)
#>  httr          1.4.3   2022-05-04 [1] CRAN (R 4.1.2)
#>  jsonlite      1.8.0   2022-02-22 [1] RSPM (R 4.1.2)
#>  knitr         1.39    2022-04-26 [1] CRAN (R 4.1.3)
#>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.1)
#>  lubridate     1.8.0   2021-10-07 [1] CRAN (R 4.1.1)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.1.3)
#>  modelr        0.1.8   2020-05-19 [1] CRAN (R 4.1.0)
#>  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.1.0)
#>  pillar        1.7.0   2022-02-01 [1] RSPM (R 4.1.2)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
#>  purrr       * 0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
#>  R.cache       0.15.0  2021-04-30 [1] CRAN (R 4.1.0)
#>  R.methodsS3   1.8.1   2020-08-26 [1] CRAN (R 4.1.0)
#>  R.oo          1.24.0  2020-08-26 [1] CRAN (R 4.1.0)
#>  R.utils       2.11.0  2021-09-26 [1] CRAN (R 4.1.1)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.1.0)
#>  readr       * 2.1.2   2022-01-30 [1] RSPM (R 4.1.2)
#>  readxl        1.4.0   2022-03-28 [1] CRAN (R 4.1.3)
#>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.1.0)
#>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.1.2)
#>  rmarkdown     2.14    2022-04-25 [1] CRAN (R 4.1.3)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.1.0)
#>  rvest         1.0.2   2021-10-16 [1] CRAN (R 4.1.1)
#>  scales        1.2.0   2022-04-13 [1] CRAN (R 4.1.3)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.1)
#>  stringi       1.7.6   2021-11-29 [1] CRAN (R 4.1.1)
#>  stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.1.0)
#>  styler        1.7.0   2022-03-13 [1] CRAN (R 4.1.2)
#>  tibble      * 3.1.7   2022-05-03 [1] CRAN (R 4.1.3)
#>  tidyr       * 1.2.0   2022-02-01 [1] RSPM (R 4.1.2)
#>  tidyselect    1.1.2   2022-02-21 [1] RSPM (R 4.1.2)
#>  tidyverse   * 1.3.1   2021-04-15 [1] CRAN (R 4.1.0)
#>  tzdb          0.3.0   2022-03-28 [1] CRAN (R 4.1.3)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
#>  vctrs         0.4.1   2022-04-13 [1] CRAN (R 4.1.3)
#>  withr         2.5.0   2022-03-03 [1] RSPM (R 4.1.2)
#>  xfun          0.31    2022-05-10 [1] CRAN (R 4.1.3)
#>  xml2          1.3.3   2021-11-30 [1] CRAN (R 4.1.1)
#>  yaml          2.3.5   2022-02-21 [1] RSPM (R 4.1.2)
#> 
#>  [1] /Users/timothychisamore/Library/R/x86_64/4.1/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.1/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

All the best,

Tim