Recoding variables by one word or phrase

Hello,

The following sample dataset has column Serotype that has Salmonella serotypes. However, some of these serotypes are written out in different ways.

For example, Salmonella Serotype Montevideo has several categories for the same serotype: Montevideo, Salmonella Group C Mont, Salmonella montevideo, Salmonella Serotype Mon.

I was wondering if anyone knew how to identify specific words or phrases in string variables and recode it to one category. For example, If I wanted to recode all cells in this column that have the word “Mont” or “Montevideo” to Salmonella Montevideo. How would someone approach that code?

Thanks in advance!

Best,
Wilma
Sero_Sample.txt (958 Bytes)

1 Like

hey wilma - the “detect within logic” section of the epirhandbook using the stringr package might help

1 Like

Hi Alex,

I gave the chapter a read through and not sure i explained myself correctly in my initial post.

I want to recode the following serotypes as “Salmonella Montevideo” by detecting a word from what is in the cell “Mont”: Montevideo, Salmonella Group C Mont, Salmonella montevideo, Salmonella Serotype Mon.

Please let me know what area in the chapter provides guidance on how to do this.

Thanks,
Wilma

1 Like

hey wilma - see example below. You could also first recode the Serotype variable to lower case using str_to_lower() and this way you would need to use less search variations

library(rio)
library(dplyr)
library(stringr)

sero_sample <- rio::import("C:/Users/spina/Downloads/Sero_Sample.txt")

# using your dataset 
sero_sample %>% 
  mutate(
    ## create a new variable (you could also just overwrite the original)
    new_var = if_else(
      # if the serotype var contains mont or Mont or Montevideo
      str_detect(Serotype, "mont|Mont|Montevideo"), 
      # recode to Salmonella Montevideo
      "Salmonella Montevideo", 
      # otherwise leave as it was 
      Serotype
  ))
#>                                    Serotype n
#> 1                                  MBANDAKA 1
#> 2                                 Manhattan 1
#> 3                                  Mbandaka 1
#> 4                                Montevideo 1
#> 5                                  Muenchen 1
#> 6                                  Muenster 1
#> 7             Salmonella Group C Montevideo 1
#> 8         Salmonella Group C1 Serotype Ohio 1
#> 9            Salmonella Group D Enteritidis 1
#> 10      Salmonella Group D Serotype Javiana 1
#> 11               Salmonella Group E1 London 1
#> 12 Salmonella Group EI Serotype Weltevreden 1
#> 13                 Salmonella I 4,5,12:i:-- 1
#> 14                Salmonella I 4,[5],12:i:- 1
#> 15                       Salmonella Javiana 1
#> 16                      Salmonella Mbandaka 1
#> 17                       Salmonella Newport 1
#> 18                   Salmonella Oranienburg 1
#> 19                         Salmonella Poona 1
#> 20                     Salmonella Saintpaul 1
#> 21            Salmonella Serotype 4,512:i:- 1
#> 22               Salmonella Serotype Cotham 1
#> 23          Salmonella Serotype Enteritidis 1
#> 24           Salmonella Serotype Montevideo 1
#> 25              Salmonella Serotype Newport 1
#> 26              Salmonella Serotype Reading 1
#> 27            Salmonella Serotype Saintpaul 1
#> 28   Salmonella Serotype Typhimurium var 5- 1
#> 29                       Salmonella Stanley 1
#> 30                   Salmonella TYPHIMURIUM 1
#> 31                      Salmonella enterica 1
#> 32                   Salmonella enteritidis 1
#> 33                      Salmonella infantis 1
#> 34                    Salmonella montevideo 1
#> 35                       Salmonella newport 1
#> 36                         Salmonella poona 1
#>                                     new_var
#> 1                                  MBANDAKA
#> 2                                 Manhattan
#> 3                                  Mbandaka
#> 4                     Salmonella Montevideo
#> 5                                  Muenchen
#> 6                                  Muenster
#> 7                     Salmonella Montevideo
#> 8         Salmonella Group C1 Serotype Ohio
#> 9            Salmonella Group D Enteritidis
#> 10      Salmonella Group D Serotype Javiana
#> 11               Salmonella Group E1 London
#> 12 Salmonella Group EI Serotype Weltevreden
#> 13                 Salmonella I 4,5,12:i:--
#> 14                Salmonella I 4,[5],12:i:-
#> 15                       Salmonella Javiana
#> 16                      Salmonella Mbandaka
#> 17                       Salmonella Newport
#> 18                   Salmonella Oranienburg
#> 19                         Salmonella Poona
#> 20                     Salmonella Saintpaul
#> 21            Salmonella Serotype 4,512:i:-
#> 22               Salmonella Serotype Cotham
#> 23          Salmonella Serotype Enteritidis
#> 24                    Salmonella Montevideo
#> 25              Salmonella Serotype Newport
#> 26              Salmonella Serotype Reading
#> 27            Salmonella Serotype Saintpaul
#> 28   Salmonella Serotype Typhimurium var 5-
#> 29                       Salmonella Stanley
#> 30                   Salmonella TYPHIMURIUM
#> 31                      Salmonella enterica
#> 32                   Salmonella enteritidis
#> 33                      Salmonella infantis
#> 34                    Salmonella Montevideo
#> 35                       Salmonella newport
#> 36                         Salmonella poona


## Alternative option to recode multiple different ones 
sero_sample %>% 
  mutate(
    new_var = case_when(
      ## if find montevideo recode to that 
      str_detect(Serotype, "mont|Mont|Montevideo") ~ "Salmonella Montevideo", 
      # if find mbandaka recode to that 
      str_detect(Serotype, "Mbandaka|MBANDAKA")   ~ "Salmonella Mbandanka", 
      # otherwise leave as original variable
      TRUE ~ Serotype
    )
  )
#>                                    Serotype n
#> 1                                  MBANDAKA 1
#> 2                                 Manhattan 1
#> 3                                  Mbandaka 1
#> 4                                Montevideo 1
#> 5                                  Muenchen 1
#> 6                                  Muenster 1
#> 7             Salmonella Group C Montevideo 1
#> 8         Salmonella Group C1 Serotype Ohio 1
#> 9            Salmonella Group D Enteritidis 1
#> 10      Salmonella Group D Serotype Javiana 1
#> 11               Salmonella Group E1 London 1
#> 12 Salmonella Group EI Serotype Weltevreden 1
#> 13                 Salmonella I 4,5,12:i:-- 1
#> 14                Salmonella I 4,[5],12:i:- 1
#> 15                       Salmonella Javiana 1
#> 16                      Salmonella Mbandaka 1
#> 17                       Salmonella Newport 1
#> 18                   Salmonella Oranienburg 1
#> 19                         Salmonella Poona 1
#> 20                     Salmonella Saintpaul 1
#> 21            Salmonella Serotype 4,512:i:- 1
#> 22               Salmonella Serotype Cotham 1
#> 23          Salmonella Serotype Enteritidis 1
#> 24           Salmonella Serotype Montevideo 1
#> 25              Salmonella Serotype Newport 1
#> 26              Salmonella Serotype Reading 1
#> 27            Salmonella Serotype Saintpaul 1
#> 28   Salmonella Serotype Typhimurium var 5- 1
#> 29                       Salmonella Stanley 1
#> 30                   Salmonella TYPHIMURIUM 1
#> 31                      Salmonella enterica 1
#> 32                   Salmonella enteritidis 1
#> 33                      Salmonella infantis 1
#> 34                    Salmonella montevideo 1
#> 35                       Salmonella newport 1
#> 36                         Salmonella poona 1
#>                                     new_var
#> 1                      Salmonella Mbandanka
#> 2                                 Manhattan
#> 3                      Salmonella Mbandanka
#> 4                     Salmonella Montevideo
#> 5                                  Muenchen
#> 6                                  Muenster
#> 7                     Salmonella Montevideo
#> 8         Salmonella Group C1 Serotype Ohio
#> 9            Salmonella Group D Enteritidis
#> 10      Salmonella Group D Serotype Javiana
#> 11               Salmonella Group E1 London
#> 12 Salmonella Group EI Serotype Weltevreden
#> 13                 Salmonella I 4,5,12:i:--
#> 14                Salmonella I 4,[5],12:i:-
#> 15                       Salmonella Javiana
#> 16                     Salmonella Mbandanka
#> 17                       Salmonella Newport
#> 18                   Salmonella Oranienburg
#> 19                         Salmonella Poona
#> 20                     Salmonella Saintpaul
#> 21            Salmonella Serotype 4,512:i:-
#> 22               Salmonella Serotype Cotham
#> 23          Salmonella Serotype Enteritidis
#> 24                    Salmonella Montevideo
#> 25              Salmonella Serotype Newport
#> 26              Salmonella Serotype Reading
#> 27            Salmonella Serotype Saintpaul
#> 28   Salmonella Serotype Typhimurium var 5-
#> 29                       Salmonella Stanley
#> 30                   Salmonella TYPHIMURIUM
#> 31                      Salmonella enterica
#> 32                   Salmonella enteritidis
#> 33                      Salmonella infantis
#> 34                    Salmonella Montevideo
#> 35                       Salmonella newport
#> 36                         Salmonella poona

Created on 2023-04-07 with reprex v2.0.2

1 Like

Thanks, Alex! This solved the problem! I was missing case_when in my code which is why i kept getting errors when trying to run mutate and stri_dect.

Thanks again!

Best,
Wilma

2 Likes