Need help with handling duplicate IDC records with conditional deletion in COVID database

I’m working with the COVID database from Module 10. While examining the data, I noticed that although there are no duplicate rows, there are duplicate IDC values. Looking at these duplicates, I see that some differ only in the city.

I need to apply the following logic:

If there are duplicate IDC values and they share the same CP (postal code), delete the record where the city is “Atlanta”.

If duplicate IDC values have different CP values, do not delete any records.

I’m providing a reproducible example below. Could you suggest an efficient approach in R/dplyr to handle this?

# reprex script  

# installar y cargar paquetes
pacman::p_load(
  rio, 
  here, 
  janitor,
  reprex,
  datapasta,
  tidyverse)

# importar datos
vig_bruta <- data.frame(
  IDC = c(NA, "a1", "b2","c3","d4", "b2","c3"),
  ciudad = c(NA, "Atlanta", "East Point", "College Park", "Union City", "Atlanta", "Atlanta"),
  cp = c(NA, "400", "401", "402", "403", "401", "402")
)
# limpiar el listado de vigilancia
vig <- vig_bruta %>% 
  clean_names()

Created on 2025-12-26 with reprex v2.1.1

Hey Jimena!

It’s very interesting that you found this in COVID case study from our Intro to R course :slight_smile: congratulations on going into this level of detail for cleaning your data!

You can do it cleanly with a grouped flag: only drop “Atlanta” rows when (a) the IDC appears more than once and (b) all duplicates share the same cp within that IDC.

Tidyverse approach that preserves everything else (including non-duplicates and duplicates with different cp):

pacman::p_load(tidyverse,janitor,reprex)

vig_bruta <- data.frame( IDC = c(NA, "a1", "b2","c3","d4", "b2","c3"), ciudad = c(NA, "Atlanta", "East Point", "College Park", "Union City", "Atlanta", "Atlanta"), cp = c(NA, "400", "401", "402", "403", "401", "402") )

vig <- vig_bruta %>%
  clean_names() %>%
  group_by(idc) %>%
  mutate(
    n_idc = n(),
    cp_unique = n_distinct(cp, na.rm = TRUE),
    drop_atlanta = n_idc > 1 & cp_unique == 1 & ciudad == "Atlanta"
  ) %>%
  ungroup() %>%
  filter(!drop_atlanta) %>%
  select(-n_idc, -cp_unique, -drop_atlanta)

vig
#> # A tibble: 5 × 3
#>   idc   ciudad       cp   
#>   <chr> <chr>        <chr>
#> 1 <NA>  <NA>         <NA> 
#> 2 a1    Atlanta      400  
#> 3 b2    East Point   401  
#> 4 c3    College Park 402  
#> 5 d4    Union City   403

Created on 2025-12-26 with reprex v2.1.1

  • n_idc > 1 ensures you only act on true duplicates of idc.
  • cp_unique == 1 enforces “same CP among duplicates”.
  • Only the “Atlanta” row is removed, and only under that condition.
  • If duplicates have different cp values (cp_unique > 1), nothing is deleted.

For your sample data, this will drop the c3 / Atlanta / 402 row (because c3 duplicates share CP=402), and keep the b2 rows (because b2 duplicates share CP=401 and one of them is Atlanta, so it will drop that Atlanta one as well).

Let me know if this is what you were looking for!

Best,

Luis

1 Like

Thank Luis for the code! It worked perfectly
Happy new year!

2 Likes