Removing repetitions

lmonteiro · December 7, 2023, 11:06pm

Hi everyone. Is there a way to remove repetitions from a variable name using dplyr? I imported the data from a ODK-based platform, and the names resulted in unnecessary repetitions. Example: us_us_visit ; demo_demo_residence ; how can I cut from us_us_visit to us_visit; demo_demo_residence to demo_residence? Thanks in advance.

lnielsen · December 8, 2023, 2:09pm

Hello @lmonteiro,

I don’t know of an existing function to address this specific issue, so I attempted to create a function using regular expressions.

Dealing with this problem can be challenging because the regular expression might need to be adapted based on the specific conditions for renaming your variables. I’ve provided a function (remove_repetitions) as a starting point. Please note that this solution may not cover all possible column name variations, so you may need to adjust it based on your specific data.

To learn more about regular expressions, I recommend checking out this chapter of the R for Data Science book or this chapter of the Epi R handbook.

Here’s a reproducible example that you can use to test the solution:

df <- data.frame(
  us_us_visit = 1:5,
  demo_demo_residence = 6:10
)

# Function to remove repetitions from variable names
remove_repetitions <- function(name) {
  gsub("(.+)_\\1", "\\1", name)
}

# Use rename_all with the remove_repetitions function
df <- df %>%
  rename_all(~remove_repetitions(.))

Feel free to test this on your actual data, and let me know if it works for your specific case or if further adjustments are needed. I hope this helps you find a path to solve your variable renaming problem.

Lucca

lmonteiro · December 9, 2023, 12:33am

Hi @lnielsen, many thanks for your insights.