Deduplicating but keep all NA

Deduplicating but not removing NA

  • I want to deduplicate a large dataset where around 90% have a unique national health number (NHI) and the remainder are NA.
  • When I deduplicate it removes all the NA (as they are duplicates obviously), however I would like to keep all these cases
  • This is a gonorrhoea dataset where people can receive multiple tests and multiple positive results.

What steps have you already taken to find an answer?

  • I have tried the Applied Epi handbook and various google options (stackexchange).

Provide an example of your R code

pacman::p_load(tidyverse)

# Create dataframe
azithromycin <- data.frame(
  stringsAsFactors = FALSE,
                 NHIx = c("4jkd8dns84ndkjfa8","3jkdssnfdk33n9nnjkfds","3nnjfds89nnk32nnkda9",                  "3nkds9dnk2nkndsa89nk","4nknnds9njknk3n8dnjanj3","4njds84nj8fnj4nnjnfd","4njds84nj8fnj4nnjnfd",NA,NA,NA),
            iMonth = c(4L, 6L, 6L, 5L, 4L, 4L, 4L, 4L, 4L, 4L),
                iYear = as.factor(c("2018","2018","2018","2018","2018","2018",
                                    "2018","2018","2018","2018")) )
  
# Recreate problem
deduplicated <- azithromycin %>% distinct(NHIx)   # this removes all except one row with NA but I want to keep all with NA

#Recreate current deduplication approach
only.nhi <- azithromycin %>% filter(!is.na(NHIx)) #extract those with NHI
null.nhi <- azithromycin %>% filter(is.na(NHIx))  # extract those without NHI

dedup.azith <- only.nhi %>% 
  group_by(iYear, iMonth) %>% 
  distinct( NHIx, .keep_all = TRUE)             #Remove duplicates with same year, month and NHI
  
azithromycin.nhi <- rbind(dedup.azith, null.nhi)  #Add those with no NHI back in. 

Follow-up

  • Thanks for any ideas!
3 Likes

Hi Callum,
What you have done in your code example is the best way to go (splitting out those where is.na(NHIx) == TRUE vs those where it’s false, deduplicating the ones where it is false, and then recombining together your two data frames.
The less-good alternative would be to fill the NA values with unique values that are pseudo NHIx IDs but I wouldn’t recommend that…!
Isaac

2 Likes

Ok, thanks Isaac,

I thought i was missing something but maybe what I’ve done is just fine!

thanks,
Callum

2 Likes