R dropping age categories when there are no values in them

,

Hello,

I’ve created the following age categories on a data set that contains death data

 Order = case_when(
        age == "0 to 4 years" ~ 1,
        age == "05 to 9 years" ~ 2,
        age == "10 to 14 years" ~ 3,
        age == "15 to 19 years" ~ 4,
        age == "20 to 24 years" ~ 5,
        age == "25 to 34 years" ~ 6,
        age == "35 to 44 years" ~ 7,
        age == "45 to 54 years" ~ 8,
        age == "55 to 59 years" ~ 9,
        age == "60 to 64 years" ~ 10,
        age == "65 to 74 years" ~ 11,
        age == "75 to 84 years" ~ 12,
        age == "85 and older" ~ 13, 
        age == "Under investigation" ~ 14)) 

I am calculating age-adjusted rates for opioid overdose deaths for these categories for years 2015-2023. The problem I am having is that when there is no data in some of the categories, the age category is dropped for a particular year. For example, in the year 2015, all age categories have values in them so all categories show in the output. However, for other years, there aren’t deaths for all the age categories listed, so R omits the age category instead of putting a 0 or NA in it.
Can anyone point out something I can do so that all age categories show despite there being no data in them? The reason I need them to show is because I need to add population data to the data frame for each category but I get an error message that tells me that the rows don’t match the number of data I am trying to add because some age categories got dropped for not having values in them.

Thank you

3 Likes

Hello,

I would use the complete function for this task, see below:

# loading packages
library(tidyverse)

# simulating fake data
sim_data <- tibble(
    year = sample(
        x = 2015L:2023L,
        size = 100,
        replace = TRUE
    ),
    age_group = sample(
        x = c(
            "0 to 4 years",
            "05 to 9 years",
            "10 to 14 years",
            "15 to 19 years",
            "20 to 24 years",
            "25 to 34 years",
            "35 to 44 years",
            "45 to 54 years",
            "55 to 59 years",
            "60 to 64 years",
            "65 to 74 years",
            "75 to 84 years",
            "85 and older",
            "Under investigation"
        ),
        size = 100,
        replace = TRUE
    )
)

# aggregating the data and filling missing combinations with 0
agg_data <- sim_data |>
    count(year, age_group) |>
    complete(
        year = 2015L:2023L,
        age_group = c(
            "0 to 4 years",
            "05 to 9 years",
            "10 to 14 years",
            "15 to 19 years",
            "20 to 24 years",
            "25 to 34 years",
            "35 to 44 years",
            "45 to 54 years",
            "55 to 59 years",
            "60 to 64 years",
            "65 to 74 years",
            "75 to 84 years",
            "85 and older",
            "Under investigation"
        ),
        fill = list(n = 0)
    )

# check dimensions
n_distinct(2015L:2023L) * n_distinct(
    c(
        "0 to 4 years",
        "05 to 9 years",
        "10 to 14 years",
        "15 to 19 years",
        "20 to 24 years",
        "25 to 34 years",
        "35 to 44 years",
        "45 to 54 years",
        "55 to 59 years",
        "60 to 64 years",
        "65 to 74 years",
        "75 to 84 years",
        "85 and older",
        "Under investigation"
    )
) == nrow(agg_data)
#> [1] TRUE

Created on 2024-05-16 with reprex v2.1.0

All the best,

Tim

1 Like

Thank you, Tim! I’ll check it out.

Hi, I just wanted to loop back and say thank you for the help! I was able to use the complete function in my code to fix this problem. Thank you so much :sob:

2 Likes

Hello! I am an R learner so maybe this is a silly question. But why is a function to turn cells with missing values into zeroes (like complete() ) better than a function to turn cells with missing values into NA (like na_if( , β€œβ€)? You can always exclude the NA values with later aggregations, and you don’t lose information on data completeness/missingness. I was just wondering!

1 Like

Hello,

These functions do slightly different things, complete actually generates the missing combinations of variables in your data and assigns whatever value you’d like to them, it doesn’t have to be NA. Whereas na_if assigns already existing values to be NA.

See below for an example where complete allows you to create combinations of year and sex and fill the values with 0. na_if alone could not achieve this because the combinations do not exist in the data prior to using complete.

# loading packages
library(tidyverse)

# creating data
fake_data <- tibble(
    year = c(2016L, 2016L, 2018L, 2021L),
    sex = c("Male", "Female", "Male", "Male"),
    n = c(65L, 3L, 72L, 100L)
)

# creating missing combinations
fake_data |>
    complete(
        year = full_seq(c(2016L, 2021L), 1),
        sex = c("Female", "Male"),
        fill = list(n = 0)
    )
#> # A tibble: 12 Γ— 3
#>     year sex        n
#>    <dbl> <chr>  <int>
#>  1  2016 Female     3
#>  2  2016 Male      65
#>  3  2017 Female     0
#>  4  2017 Male       0
#>  5  2018 Female     0
#>  6  2018 Male      72
#>  7  2019 Female     0
#>  8  2019 Male       0
#>  9  2020 Female     0
#> 10  2020 Male       0
#> 11  2021 Female     0
#> 12  2021 Male     100

Created on 2024-05-24 with reprex v2.1.0

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.3 (2024-02-29)
#>  os       macOS Sonoma 14.4.1
#>  system   x86_64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Toronto
#>  date     2024-05-24
#>  pandoc   3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/x86_64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.0)
#>  colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
#>  digest        0.6.35  2024-03-11 [1] RSPM (R 4.3.0)
#>  dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.3.0)
#>  evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.0)
#>  fansi         1.0.6   2023-12-08 [1] CRAN (R 4.3.0)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
#>  forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
#>  fs            1.6.4   2024-04-25 [1] RSPM (R 4.3.0)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
#>  ggplot2     * 3.5.1   2024-04-23 [1] RSPM (R 4.3.0)
#>  glue          1.7.0   2024-01-09 [1] RSPM (R 4.3.0)
#>  gtable        0.3.5   2024-04-22 [1] RSPM (R 4.3.0)
#>  hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
#>  htmltools     0.5.8.1 2024-04-04 [1] RSPM (R 4.3.0)
#>  knitr         1.46    2024-04-06 [1] RSPM (R 4.3.0)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.0)
#>  lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
#>  munsell       0.5.1   2024-04-01 [1] RSPM (R 4.3.0)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
#>  purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo          1.26.0  2024-01-24 [1] RSPM (R 4.3.0)
#>  R.utils       2.12.3  2023-11-18 [1] CRAN (R 4.3.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
#>  readr       * 2.1.5   2024-01-10 [1] RSPM (R 4.3.0)
#>  reprex        2.1.0   2024-01-11 [1] RSPM (R 4.3.0)
#>  rlang         1.1.3   2024-01-10 [1] RSPM (R 4.3.0)
#>  rmarkdown     2.26    2024-03-05 [1] RSPM (R 4.3.0)
#>  rstudioapi    0.16.0  2024-03-24 [1] RSPM (R 4.3.0)
#>  scales        1.3.0   2023-11-28 [1] CRAN (R 4.3.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
#>  stringi       1.8.3   2023-12-11 [1] CRAN (R 4.3.0)
#>  stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.3.0)
#>  styler        1.10.3  2024-04-07 [1] RSPM (R 4.3.0)
#>  tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
#>  tidyr       * 1.3.1   2024-01-24 [1] RSPM (R 4.3.0)
#>  tidyselect    1.2.1   2024-03-11 [1] RSPM (R 4.3.0)
#>  tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
#>  timechange    0.3.0   2024-01-18 [1] RSPM (R 4.3.0)
#>  tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
#>  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.0)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.0)
#>  withr         3.0.0   2024-01-16 [1] RSPM (R 4.3.0)
#>  xfun          0.43    2024-03-25 [1] RSPM (R 4.3.0)
#>  yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.0)
#> 
#>  [1] /Users/timothychisamore/Library/R/x86_64/4.3/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

All the best,

Tim

1 Like

Wow, thank you. That was a clear demonstration!

1 Like