Hello again! Sorry for flooding the forum lately. I still have a lot to learn and this community has been the most helpful!
I’m trying to use a combination of str_detect() and regex() based on the characters and strings chapter of the handbook to identify certain diagnoses from a free text field.The end product should be new binary variables for each diagnosis.
The problem is with certain diagnosis abbreviations such as “as” for “aortic stenosis”, the code also incorrectly counts “asd” which is for “atrial septal defect” perhaps because “asd” has “as” in it.
How do I make “as” specific to “as” only and not detect “asd” or other strings with “as” in it?
Provide an example of your R code
# create df
df <- data.frame(
stringsAsFactors = FALSE,
specify_other_chd = c("aortic stenosis",
"as",
"atrial septal defect",
"asd",
"patent foramen ovale, aortic stenosis",
"pfo",
"pfo, asd",
"aortic stenosis, asd",
"foramen ovale",
"as, asd"))
# create new variables using str_detect and regex
df <- df %>%
mutate(
ssx_as = case_when(
str_detect(specify_other_chd, # string variable to search in
regex("as|aortic stenosis", # term search within variable
ignore_case = TRUE)) ~ 1, # not case sensitive
TRUE ~ 0 # all other terms not matching
),
ssx_asd = case_when(
str_detect(specify_other_chd, # string variable to search in
regex("asd|atrial septal defect", # term search within variable
ignore_case = TRUE)) ~ 1, # not case sensitive
TRUE ~ 0 # all other terms not matching
),
ssx_pfo = case_when(
str_detect(specify_other_chd, # string variable to search in
regex("pfo|patent foramen ovale|foramen ovale", # term search within variable
ignore_case = TRUE)) ~ 1, # not case sensitive
TRUE ~ 0 # all other terms not matching
)
)