Deduplicating but not removing NA
- I want to deduplicate a large dataset where around 90% have a unique national health number (NHI) and the remainder are NA.
- When I deduplicate it removes all the NA (as they are duplicates obviously), however I would like to keep all these cases
- This is a gonorrhoea dataset where people can receive multiple tests and multiple positive results.
What steps have you already taken to find an answer?
- I have tried the Applied Epi handbook and various google options (stackexchange).
Provide an example of your R code
pacman::p_load(tidyverse)
# Create dataframe
azithromycin <- data.frame(
stringsAsFactors = FALSE,
NHIx = c("4jkd8dns84ndkjfa8","3jkdssnfdk33n9nnjkfds","3nnjfds89nnk32nnkda9", "3nkds9dnk2nkndsa89nk","4nknnds9njknk3n8dnjanj3","4njds84nj8fnj4nnjnfd","4njds84nj8fnj4nnjnfd",NA,NA,NA),
iMonth = c(4L, 6L, 6L, 5L, 4L, 4L, 4L, 4L, 4L, 4L),
iYear = as.factor(c("2018","2018","2018","2018","2018","2018",
"2018","2018","2018","2018")) )
# Recreate problem
deduplicated <- azithromycin %>% distinct(NHIx) # this removes all except one row with NA but I want to keep all with NA
#Recreate current deduplication approach
only.nhi <- azithromycin %>% filter(!is.na(NHIx)) #extract those with NHI
null.nhi <- azithromycin %>% filter(is.na(NHIx)) # extract those without NHI
dedup.azith <- only.nhi %>%
group_by(iYear, iMonth) %>%
distinct( NHIx, .keep_all = TRUE) #Remove duplicates with same year, month and NHI
azithromycin.nhi <- rbind(dedup.azith, null.nhi) #Add those with no NHI back in.
Follow-up
- Thanks for any ideas!