The following sample dataset has column Serotype that has Salmonella serotypes. However, some of these serotypes are written out in different ways.
For example, Salmonella Serotype Montevideo has several categories for the same serotype: Montevideo, Salmonella Group C Mont, Salmonella montevideo, Salmonella Serotype Mon.
I was wondering if anyone knew how to identify specific words or phrases in string variables and recode it to one category. For example, If I wanted to recode all cells in this column that have the word “Mont” or “Montevideo” to Salmonella Montevideo. How would someone approach that code?
I gave the chapter a read through and not sure i explained myself correctly in my initial post.
I want to recode the following serotypes as “Salmonella Montevideo” by detecting a word from what is in the cell “Mont”: Montevideo, Salmonella Group C Mont, Salmonella montevideo, Salmonella Serotype Mon.
Please let me know what area in the chapter provides guidance on how to do this.
hey wilma - see example below. You could also first recode the Serotype variable to lower case using str_to_lower() and this way you would need to use less search variations
library(rio)
library(dplyr)
library(stringr)
sero_sample <- rio::import("C:/Users/spina/Downloads/Sero_Sample.txt")
# using your dataset
sero_sample %>%
mutate(
## create a new variable (you could also just overwrite the original)
new_var = if_else(
# if the serotype var contains mont or Mont or Montevideo
str_detect(Serotype, "mont|Mont|Montevideo"),
# recode to Salmonella Montevideo
"Salmonella Montevideo",
# otherwise leave as it was
Serotype
))
#> Serotype n
#> 1 MBANDAKA 1
#> 2 Manhattan 1
#> 3 Mbandaka 1
#> 4 Montevideo 1
#> 5 Muenchen 1
#> 6 Muenster 1
#> 7 Salmonella Group C Montevideo 1
#> 8 Salmonella Group C1 Serotype Ohio 1
#> 9 Salmonella Group D Enteritidis 1
#> 10 Salmonella Group D Serotype Javiana 1
#> 11 Salmonella Group E1 London 1
#> 12 Salmonella Group EI Serotype Weltevreden 1
#> 13 Salmonella I 4,5,12:i:-- 1
#> 14 Salmonella I 4,[5],12:i:- 1
#> 15 Salmonella Javiana 1
#> 16 Salmonella Mbandaka 1
#> 17 Salmonella Newport 1
#> 18 Salmonella Oranienburg 1
#> 19 Salmonella Poona 1
#> 20 Salmonella Saintpaul 1
#> 21 Salmonella Serotype 4,512:i:- 1
#> 22 Salmonella Serotype Cotham 1
#> 23 Salmonella Serotype Enteritidis 1
#> 24 Salmonella Serotype Montevideo 1
#> 25 Salmonella Serotype Newport 1
#> 26 Salmonella Serotype Reading 1
#> 27 Salmonella Serotype Saintpaul 1
#> 28 Salmonella Serotype Typhimurium var 5- 1
#> 29 Salmonella Stanley 1
#> 30 Salmonella TYPHIMURIUM 1
#> 31 Salmonella enterica 1
#> 32 Salmonella enteritidis 1
#> 33 Salmonella infantis 1
#> 34 Salmonella montevideo 1
#> 35 Salmonella newport 1
#> 36 Salmonella poona 1
#> new_var
#> 1 MBANDAKA
#> 2 Manhattan
#> 3 Mbandaka
#> 4 Salmonella Montevideo
#> 5 Muenchen
#> 6 Muenster
#> 7 Salmonella Montevideo
#> 8 Salmonella Group C1 Serotype Ohio
#> 9 Salmonella Group D Enteritidis
#> 10 Salmonella Group D Serotype Javiana
#> 11 Salmonella Group E1 London
#> 12 Salmonella Group EI Serotype Weltevreden
#> 13 Salmonella I 4,5,12:i:--
#> 14 Salmonella I 4,[5],12:i:-
#> 15 Salmonella Javiana
#> 16 Salmonella Mbandaka
#> 17 Salmonella Newport
#> 18 Salmonella Oranienburg
#> 19 Salmonella Poona
#> 20 Salmonella Saintpaul
#> 21 Salmonella Serotype 4,512:i:-
#> 22 Salmonella Serotype Cotham
#> 23 Salmonella Serotype Enteritidis
#> 24 Salmonella Montevideo
#> 25 Salmonella Serotype Newport
#> 26 Salmonella Serotype Reading
#> 27 Salmonella Serotype Saintpaul
#> 28 Salmonella Serotype Typhimurium var 5-
#> 29 Salmonella Stanley
#> 30 Salmonella TYPHIMURIUM
#> 31 Salmonella enterica
#> 32 Salmonella enteritidis
#> 33 Salmonella infantis
#> 34 Salmonella Montevideo
#> 35 Salmonella newport
#> 36 Salmonella poona
## Alternative option to recode multiple different ones
sero_sample %>%
mutate(
new_var = case_when(
## if find montevideo recode to that
str_detect(Serotype, "mont|Mont|Montevideo") ~ "Salmonella Montevideo",
# if find mbandaka recode to that
str_detect(Serotype, "Mbandaka|MBANDAKA") ~ "Salmonella Mbandanka",
# otherwise leave as original variable
TRUE ~ Serotype
)
)
#> Serotype n
#> 1 MBANDAKA 1
#> 2 Manhattan 1
#> 3 Mbandaka 1
#> 4 Montevideo 1
#> 5 Muenchen 1
#> 6 Muenster 1
#> 7 Salmonella Group C Montevideo 1
#> 8 Salmonella Group C1 Serotype Ohio 1
#> 9 Salmonella Group D Enteritidis 1
#> 10 Salmonella Group D Serotype Javiana 1
#> 11 Salmonella Group E1 London 1
#> 12 Salmonella Group EI Serotype Weltevreden 1
#> 13 Salmonella I 4,5,12:i:-- 1
#> 14 Salmonella I 4,[5],12:i:- 1
#> 15 Salmonella Javiana 1
#> 16 Salmonella Mbandaka 1
#> 17 Salmonella Newport 1
#> 18 Salmonella Oranienburg 1
#> 19 Salmonella Poona 1
#> 20 Salmonella Saintpaul 1
#> 21 Salmonella Serotype 4,512:i:- 1
#> 22 Salmonella Serotype Cotham 1
#> 23 Salmonella Serotype Enteritidis 1
#> 24 Salmonella Serotype Montevideo 1
#> 25 Salmonella Serotype Newport 1
#> 26 Salmonella Serotype Reading 1
#> 27 Salmonella Serotype Saintpaul 1
#> 28 Salmonella Serotype Typhimurium var 5- 1
#> 29 Salmonella Stanley 1
#> 30 Salmonella TYPHIMURIUM 1
#> 31 Salmonella enterica 1
#> 32 Salmonella enteritidis 1
#> 33 Salmonella infantis 1
#> 34 Salmonella montevideo 1
#> 35 Salmonella newport 1
#> 36 Salmonella poona 1
#> new_var
#> 1 Salmonella Mbandanka
#> 2 Manhattan
#> 3 Salmonella Mbandanka
#> 4 Salmonella Montevideo
#> 5 Muenchen
#> 6 Muenster
#> 7 Salmonella Montevideo
#> 8 Salmonella Group C1 Serotype Ohio
#> 9 Salmonella Group D Enteritidis
#> 10 Salmonella Group D Serotype Javiana
#> 11 Salmonella Group E1 London
#> 12 Salmonella Group EI Serotype Weltevreden
#> 13 Salmonella I 4,5,12:i:--
#> 14 Salmonella I 4,[5],12:i:-
#> 15 Salmonella Javiana
#> 16 Salmonella Mbandanka
#> 17 Salmonella Newport
#> 18 Salmonella Oranienburg
#> 19 Salmonella Poona
#> 20 Salmonella Saintpaul
#> 21 Salmonella Serotype 4,512:i:-
#> 22 Salmonella Serotype Cotham
#> 23 Salmonella Serotype Enteritidis
#> 24 Salmonella Montevideo
#> 25 Salmonella Serotype Newport
#> 26 Salmonella Serotype Reading
#> 27 Salmonella Serotype Saintpaul
#> 28 Salmonella Serotype Typhimurium var 5-
#> 29 Salmonella Stanley
#> 30 Salmonella TYPHIMURIUM
#> 31 Salmonella enterica
#> 32 Salmonella enteritidis
#> 33 Salmonella infantis
#> 34 Salmonella Montevideo
#> 35 Salmonella newport
#> 36 Salmonella poona