R dropping age categories when there are no values in them

dmelchor · May 16, 2024, 11:02pm

Hello,

I’ve created the following age categories on a data set that contains death data

 Order = case_when(
        age == "0 to 4 years" ~ 1,
        age == "05 to 9 years" ~ 2,
        age == "10 to 14 years" ~ 3,
        age == "15 to 19 years" ~ 4,
        age == "20 to 24 years" ~ 5,
        age == "25 to 34 years" ~ 6,
        age == "35 to 44 years" ~ 7,
        age == "45 to 54 years" ~ 8,
        age == "55 to 59 years" ~ 9,
        age == "60 to 64 years" ~ 10,
        age == "65 to 74 years" ~ 11,
        age == "75 to 84 years" ~ 12,
        age == "85 and older" ~ 13, 
        age == "Under investigation" ~ 14))

I am calculating age-adjusted rates for opioid overdose deaths for these categories for years 2015-2023. The problem I am having is that when there is no data in some of the categories, the age category is dropped for a particular year. For example, in the year 2015, all age categories have values in them so all categories show in the output. However, for other years, there aren’t deaths for all the age categories listed, so R omits the age category instead of putting a 0 or NA in it.
Can anyone point out something I can do so that all age categories show despite there being no data in them? The reason I need them to show is because I need to add population data to the data frame for each category but I get an error message that tells me that the rows don’t match the number of data I am trying to add because some age categories got dropped for not having values in them.

Thank you

machupovirus · May 16, 2024, 11:59pm

Hello,

I would use the complete function for this task, see below:

# loading packages
library(tidyverse)

# simulating fake data
sim_data <- tibble(
    year = sample(
        x = 2015L:2023L,
        size = 100,
        replace = TRUE
    ),
    age_group = sample(
        x = c(
            "0 to 4 years",
            "05 to 9 years",
            "10 to 14 years",
            "15 to 19 years",
            "20 to 24 years",
            "25 to 34 years",
            "35 to 44 years",
            "45 to 54 years",
            "55 to 59 years",
            "60 to 64 years",
            "65 to 74 years",
            "75 to 84 years",
            "85 and older",
            "Under investigation"
        ),
        size = 100,
        replace = TRUE
    )
)

# aggregating the data and filling missing combinations with 0
agg_data <- sim_data |>
    count(year, age_group) |>
    complete(
        year = 2015L:2023L,
        age_group = c(
            "0 to 4 years",
            "05 to 9 years",
            "10 to 14 years",
            "15 to 19 years",
            "20 to 24 years",
            "25 to 34 years",
            "35 to 44 years",
            "45 to 54 years",
            "55 to 59 years",
            "60 to 64 years",
            "65 to 74 years",
            "75 to 84 years",
            "85 and older",
            "Under investigation"
        ),
        fill = list(n = 0)
    )

# check dimensions
n_distinct(2015L:2023L) * n_distinct(
    c(
        "0 to 4 years",
        "05 to 9 years",
        "10 to 14 years",
        "15 to 19 years",
        "20 to 24 years",
        "25 to 34 years",
        "35 to 44 years",
        "45 to 54 years",
        "55 to 59 years",
        "60 to 64 years",
        "65 to 74 years",
        "75 to 84 years",
        "85 and older",
        "Under investigation"
    )
) == nrow(agg_data)
#> [1] TRUE

^{Created on 2024-05-16 with reprex v2.1.0}

All the best,

Tim

dmelchor · May 17, 2024, 4:35pm

Thank you, Tim! I’ll check it out.

Hi, I just wanted to loop back and say thank you for the help! I was able to use the complete function in my code to fix this problem. Thank you so much

kirstin.huiber · May 24, 2024, 5:23pm

Hello! I am an R learner so maybe this is a silly question. But why is a function to turn cells with missing values into zeroes (like complete() ) better than a function to turn cells with missing values into NA (like na_if( , “”)? You can always exclude the NA values with later aggregations, and you don’t lose information on data completeness/missingness. I was just wondering!

machupovirus · May 25, 2024, 12:40am

Hello,

These functions do slightly different things, complete actually generates the missing combinations of variables in your data and assigns whatever value you’d like to them, it doesn’t have to be NA. Whereas na_if assigns already existing values to be NA.

See below for an example where complete allows you to create combinations of year and sex and fill the values with 0. na_if alone could not achieve this because the combinations do not exist in the data prior to using complete.

# loading packages
library(tidyverse)

# creating data
fake_data <- tibble(
    year = c(2016L, 2016L, 2018L, 2021L),
    sex = c("Male", "Female", "Male", "Male"),
    n = c(65L, 3L, 72L, 100L)
)

# creating missing combinations
fake_data |>
    complete(
        year = full_seq(c(2016L, 2021L), 1),
        sex = c("Female", "Male"),
        fill = list(n = 0)
    )
#> # A tibble: 12 × 3
#>     year sex        n
#>    <dbl> <chr>  <int>
#>  1  2016 Female     3
#>  2  2016 Male      65
#>  3  2017 Female     0
#>  4  2017 Male       0
#>  5  2018 Female     0
#>  6  2018 Male      72
#>  7  2019 Female     0
#>  8  2019 Male       0
#>  9  2020 Female     0
#> 10  2020 Male       0
#> 11  2021 Female     0
#> 12  2021 Male     100

^{Created on 2024-05-24 with reprex v2.1.0}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.3 (2024-02-29)
#>  os       macOS Sonoma 14.4.1
#>  system   x86_64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Toronto
#>  date     2024-05-24
#>  pandoc   3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/x86_64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.0)
#>  colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.0)
#>  digest        0.6.35  2024-03-11 [1] RSPM (R 4.3.0)
#>  dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.3.0)
#>  evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.0)
#>  fansi         1.0.6   2023-12-08 [1] CRAN (R 4.3.0)
#>  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.0)
#>  forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.0)
#>  fs            1.6.4   2024-04-25 [1] RSPM (R 4.3.0)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.0)
#>  ggplot2     * 3.5.1   2024-04-23 [1] RSPM (R 4.3.0)
#>  glue          1.7.0   2024-01-09 [1] RSPM (R 4.3.0)
#>  gtable        0.3.5   2024-04-22 [1] RSPM (R 4.3.0)
#>  hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.0)
#>  htmltools     0.5.8.1 2024-04-04 [1] RSPM (R 4.3.0)
#>  knitr         1.46    2024-04-06 [1] RSPM (R 4.3.0)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.0)
#>  lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.0)
#>  munsell       0.5.1   2024-04-01 [1] RSPM (R 4.3.0)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.0)
#>  purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.0)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo          1.26.0  2024-01-24 [1] RSPM (R 4.3.0)
#>  R.utils       2.12.3  2023-11-18 [1] CRAN (R 4.3.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.0)
#>  readr       * 2.1.5   2024-01-10 [1] RSPM (R 4.3.0)
#>  reprex        2.1.0   2024-01-11 [1] RSPM (R 4.3.0)
#>  rlang         1.1.3   2024-01-10 [1] RSPM (R 4.3.0)
#>  rmarkdown     2.26    2024-03-05 [1] RSPM (R 4.3.0)
#>  rstudioapi    0.16.0  2024-03-24 [1] RSPM (R 4.3.0)
#>  scales        1.3.0   2023-11-28 [1] CRAN (R 4.3.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.0)
#>  stringi       1.8.3   2023-12-11 [1] CRAN (R 4.3.0)
#>  stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.3.0)
#>  styler        1.10.3  2024-04-07 [1] RSPM (R 4.3.0)
#>  tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.0)
#>  tidyr       * 1.3.1   2024-01-24 [1] RSPM (R 4.3.0)
#>  tidyselect    1.2.1   2024-03-11 [1] RSPM (R 4.3.0)
#>  tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.0)
#>  timechange    0.3.0   2024-01-18 [1] RSPM (R 4.3.0)
#>  tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.0)
#>  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.0)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.0)
#>  withr         3.0.0   2024-01-16 [1] RSPM (R 4.3.0)
#>  xfun          0.43    2024-03-25 [1] RSPM (R 4.3.0)
#>  yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.0)
#> 
#>  [1] /Users/timothychisamore/Library/R/x86_64/4.3/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

All the best,

Tim

kirstin.huiber · May 28, 2024, 6:30pm

Wow, thank you. That was a clear demonstration!