R code for data management

Thank you in advance for your help! I am a new user (R user) and I have a few questions regarding data management. I have our monthly report data in Excel, extracted from Calredie.
In the ‘Disease’ column, there are various disease names that need to be cleaned first and then be grouped into six different categories, such as STDs or foodborne diseases, for reporting six distinct bar charts for reporting purposes.
My questions are: 1. How can I rename certain disease names within the ‘Disease’ column without creating an additional column?
2. How can I create six different columns, possibly using the ‘mutate’ code, based on each category of disease?
I truly appreciate any tips you can provide!

1 Like

Hi @myekta, thanks for sharing your questions! Typically, we recommend providing a reproducible example with a sample of your dataset, but I believe I can assist you. For future questions, check the link How to Post an R Code Question.

Firstly, I’ve generated a “fake” dataset to illustrate potential solutions:

df <- data.frame(Disease = sample(c("Syphiliz", "Gonorrhea", "Influenza", "Salmonella", "Tuberculosis", "E. coli"), 20, replace = TRUE))

Concerning your initial question, I suggest utilizing the case_when function within a mutate. Consider the following example:

df2 <- df %>%
  mutate(Disease = case_when(
    Disease == "Syphiliz" ~ "Syphilis",
    Disease == "E. coli" ~ "Escherichia coli",
    TRUE ~ Disease

In this example, I am reassigning values to certain disease names in the ‘Disease’ column, and the TRUE ~ Disease ensures that everything else remains unchanged. The case_when function operates similarly to ifelse; it checks if a condition is TRUE or FALSE and assigns a new value accordingly.

Regarding your second question, I recommend using the ifelse function within a mutate. For instance:

df3 <- df2 %>%
  mutate(stds = ifelse(Disease %in% c("Gonorrhea", "Syphilis"), "yes", "no"),
         foodborne_diseases = ifelse(Disease %in% c("Salmonella", "Escherichia coli"), "yes", "no"))

You can repeat this process for other categories by creating new columns based on specific conditions. Thus, if certain values are found in Disease column, we fill the new column with “yes” otherwise, “no”.

I hope it helps. Let me know.



Thank you for your question. @lnielsen has provided a great solution. For some further context if needed.

  1. Renaming a column within the same column

Some programmes like SPSS have seperate commands to "recode into same variable " and “recode into different variable”. The tidyverse however covers both these SPSS functions with a single command mutate . If you want to create a new column simply give the variable on the left a name that does not exist in the datset. If you want to change an existing column then you need to use the name of an existing column.

So in the example given by @lnielsen if you were to say


df2 <- df %>%
  mutate(new_disease = case_when(
    Disease == "Syphiliz" ~ "Syphilis",
    Disease == "E. coli" ~ "Escherichia coli",
    TRUE ~ Disease

that would create a new column called new_disease . Notice how he has used Disease instead of new_disease. Since there is a existing column called Disease it renames inside the same column.

  1. @lnielsen has provided a great approach. In this he creates 2 new columns stds and foodborne_diseases. The stds column will fill out “Yes” if the value in the Disease column is “Gonorrhea or Syphillis” and “No” if it is any other value. Similarly the foodborne_disease column will fill out “yes” for Salmonella and E.Coli and “No” for any other value. This should allow you to count the number of STDs or Foodborne Illnesses.

Hope this helps

1 Like

@myekta please let us know if any of the above answers helped you, or not.
Thank you.

Thank you so much Lucca for your response!

Thank you Deepak!

Thank you for the follow-up! I sent feedback via email, but it appears that it hasn’t been received.