Problem with generating age groups

iancgmd · July 26, 2022, 12:27pm

I have a dataset with age (in years). I tried to create age groupings with the following code which partially worked but did not put an age group for the last group (75-79) even though there was an observation with age = 78. I can’t figure out what’s wrong.

quick summary of age

summary(linelist$age)
#Min. 1st Qu. Median Mean 3rd Qu. Max.
#0.00 7.00 13.00 17.79 23.00 78.00

create 5-year age groups

linelist ← linelist %>%
mutate(
# Create categories
age_grp5 = dplyr::case_when(
age < 5 ~ “0-4”,
age >= 5 & age <= 9 ~ “5-9”,
age >= 10 & age <= 14 ~ “10-14”,
age >= 15 & age <= 19 ~ “15-19”,
age >= 20 & age <= 24 ~ “20-24”,
age >= 25 & age <= 29 ~ “25-29”,
age >= 30 & age <= 34 ~ “30-34”,
age >= 35 & age <= 39 ~ “35-39”,
age >= 40 & age <= 44 ~ “40-44”,
age >= 45 & age <= 49 ~ “45-49”,
age >= 50 & age <= 54 ~ “50-54”,
age >= 55 & age <= 59 ~ “55-59”,
age >= 60 & age <= 64 ~ “60-64”,
age >= 65 & age <= 69 ~ “65-69”,
age >= 70 & age <= 74 ~ “70-74”,
age >= 75 & age <= 79 ~ “75-59”
),
# Convert to factor
age_grp5 = factor(
age_grp5,
level = c(“0-4”, “5-9”,“10-14”, “15-19”, “20-24”, “25-29”, “30-34”,
“35-39”, “40-44”, “45-49”, “50-54”, “55-59”, “60-64”,
“65-69”, “70-74”, “75-79”)
)
)

see table of age_grp5

table(linelist$age_grp5)
#0-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79
#34 47 36 28 12 11 4 8 10 3 4 2 5 1 0 0

neale · July 26, 2022, 9:58pm

Hi Ian, thanks for posting!

I would suggest using instead the age_categories() function from {epikit} - see this part of the Epi R Handbook. It will also convert to class factor automatically.

linelist <- linelist %>%
   mutate(age_grp5 = age_categories(
      age, 
      lower = 0,
      upper = 80,
      by = 5))

As to why your code did not work as expected - I do see a typo in your final case_when() line where you assign the final age range to “75-59”. This group is not encoded in your factor (in which you wrote the correct age label), so is probably dropped from the new factor.

If you still want to use case_when() for a similar scenario, you can also take advantage of the fact that for each row in the data, it evaluates the case_when criteria from top to bottom. This allows you to simplify the code:

linelist ← linelist %>%
mutate(
age_grp5 = dplyr::case_when(
age < 5 ~ “0-4”,
age <= 9 ~ “5-9”,
age <= 14 ~ “10-14”,
age <= 19 ~ “15-19”,
age <= 24 ~ “20-24”,
age <= 29 ~ “25-29”,
age <= 34 ~ “30-34”,
age <= 39 ~ “35-39”,
age <= 44 ~ “40-44”,
age <= 49 ~ “45-49”,
age <= 54 ~ “50-54”,
age <= 59 ~ “55-59”,
age <= 64 ~ “60-64”,
age <= 69 ~ “65-69”,
age <= 74 ~ “70-74”,
age <= 79 ~ “75-79”
),
# Convert to factor
age_grp5 = factor(
age_grp5,
level = c(“0-4”, “5-9”,“10-14”, “15-19”, “20-24”, “25-29”, “30-34”,
“35-39”, “40-44”, “45-49”, “50-54”, “55-59”, “60-64”,
“65-69”, “70-74”, “75-79”)
)
)

iancgmd · July 27, 2022, 2:07am

Thank you, Neale. I looked over the code a couple of times and I really missed the simple typo. Though I’m still glad I posted it since I learned how to make the case_when() more efficient and I learned about the age_categories() function. Cheers!