Using str_detect() and regex() to identify specific phrases and not just parts of strings

Describe your issue

  • Hello again! Sorry for flooding the forum lately. I still have a lot to learn and this community has been the most helpful!
  • I’m trying to use a combination of str_detect() and regex() based on the characters and strings chapter of the handbook to identify certain diagnoses from a free text field.The end product should be new binary variables for each diagnosis.
  • The problem is with certain diagnosis abbreviations such as “as” for “aortic stenosis”, the code also incorrectly counts “asd” which is for “atrial septal defect” perhaps because “asd” has “as” in it.
  • How do I make “as” specific to “as” only and not detect “asd” or other strings with “as” in it?

image

Provide an example of your R code

# create df

df <- data.frame(
stringsAsFactors = FALSE,
specify_other_chd = c("aortic stenosis", 
                      "as", 
                      "atrial septal defect", 
                      "asd",
                      "patent foramen ovale, aortic stenosis",
                      "pfo", 
                      "pfo, asd",
                      "aortic stenosis, asd",
                      "foramen ovale", 
                      "as, asd"))

# create new variables using str_detect and regex
df <- df %>% 
mutate(
  ssx_as = case_when(
              str_detect(specify_other_chd,        # string variable to search in
              regex("as|aortic stenosis",          # term search within variable
              ignore_case = TRUE))            ~ 1, # not case sensitive
              TRUE                            ~ 0  # all other terms not matching
                    ),
  ssx_asd = case_when(
                str_detect(specify_other_chd,                      # string variable to search in
                regex("asd|atrial septal defect", # term search within variable
                ignore_case = TRUE))            ~ 1,    # not case sensitive
                TRUE                            ~ 0    # all other terms not matching
                       ),
  ssx_pfo = case_when(
                str_detect(specify_other_chd,                      # string variable to search in
                regex("pfo|patent foramen ovale|foramen ovale", # term search within variable
                ignore_case = TRUE))            ~ 1,    # not case sensitive
                TRUE                            ~ 0    # all other terms not matching
                       )

  )

1 Like

Hello @iancgmd

A useful approach to this is to specify some rules inside the regex function.

Try this code:

df <- df %>%
        mutate(
                ssx_as = case_when(
                        str_detect(specify_other_chd, regex("\\b(as|aortic stenosis)\\b", ignore_case = TRUE)) ~ 1,
                        TRUE ~ 0),
                
                ssx_asd = case_when(
                        str_detect(specify_other_chd, regex("\\b(asd|atrial septal defect)\\b", ignore_case = TRUE)) ~ 1,
                        TRUE ~ 0),
                
                ssx_pfo = case_when(
                        str_detect(specify_other_chd, 
                                   regex("pfo|patent foramen ovale|foramen ovale", ignore_case = TRUE)) ~ 1,
                        TRUE ~ 0))

The \\b is used to match word boundaries, which ensures that the pattern as is only matched when it appears as a standalone word.

These kind of regex rules are not very intuitive, you can find more information here R for Data Science (2e) - 16  Regular expressions

The output

                       specify_other_chd ssx_as ssx_asd ssx_pfo
1                        aortic stenosis      1       0       0
2                                     as      1       0       0
3                   atrial septal defect      0       1       0
4                                    asd      0       1       0
5  patent foramen ovale, aortic stenosis      1       0       1
6                                    pfo      0       0       1
7                               pfo, asd      0       1       1
8                   aortic stenosis, asd      1       1       0
9                          foramen ovale      0       0       1
10                               as, asd      1       1       0
2 Likes

Thank you! This worked perfectly. And thanks for pointing out additional resources as well!

1 Like

I never get regexs right (ever!) - but this package is pretty nice for helping to put them together

3 Likes