Deduplicating but not removing NA
- I want to deduplicate a large dataset where around 90% have a unique national health number (NHI) and the remainder are NA.
- When I deduplicate it removes all the NA (as they are duplicates obviously), however I would like to keep all these cases
- This is a gonorrhoea dataset where people can receive multiple tests and multiple positive results.
- I have tried the Applied Epi handbook and various google options (stackexchange).
pacman::p_load(tidyverse) # Create dataframe azithromycin <- data.frame( stringsAsFactors = FALSE, NHIx = c("4jkd8dns84ndkjfa8","3jkdssnfdk33n9nnjkfds","3nnjfds89nnk32nnkda9", "3nkds9dnk2nkndsa89nk","4nknnds9njknk3n8dnjanj3","4njds84nj8fnj4nnjnfd","4njds84nj8fnj4nnjnfd",NA,NA,NA), iMonth = c(4L, 6L, 6L, 5L, 4L, 4L, 4L, 4L, 4L, 4L), iYear = as.factor(c("2018","2018","2018","2018","2018","2018", "2018","2018","2018","2018")) ) # Recreate problem deduplicated <- azithromycin %>% distinct(NHIx) # this removes all except one row with NA but I want to keep all with NA #Recreate current deduplication approach only.nhi <- azithromycin %>% filter(!is.na(NHIx)) #extract those with NHI null.nhi <- azithromycin %>% filter(is.na(NHIx)) # extract those without NHI dedup.azith <- only.nhi %>% group_by(iYear, iMonth) %>% distinct( NHIx, .keep_all = TRUE) #Remove duplicates with same year, month and NHI azithromycin.nhi <- rbind(dedup.azith, null.nhi) #Add those with no NHI back in.
- Thanks for any ideas!