Hello,
I have an aggregate data set such that I have the frequencies for each combination of certain variables of interest. However, I am interested in reverting this aggregate data back to a linelist format. Is there an easy way I can do this with existing functions in the tidyverse packages?
Here is an example of what my data looks like:
library(tidyverse)
tibble::tribble(
~sex, ~age_group, ~n,
"female", "0-19", 0,
"female", "20-29", 4,
"male", "0-19", 1
)
#> # A tibble: 3 Γ 3
#> sex age_group n
#> <chr> <chr> <dbl>
#> 1 female 0-19 0
#> 2 female 20-29 4
#> 3 male 0-19 1
Created on 2022-04-21 by the reprex package (v2.0.1)
Regards,
John
Hello John,
This is the exact scenario I was recently in! As you suspected, there is already a function, called uncount, in the tidyr package to do this exact thing.
Here is some code demonstrating how it works to βunaggregateβ aggregate data into linelist data:
#Loading tidyverse
library(tidyverse)
#Generating aggregate data
agg_data <- expand_grid(
sex = c("female", "male"),
age_group = c("0-19", "20-29", "30-39", "40-49", "50-59", "60+")
) |>
rowwise() |>
mutate(n = rpois(n = 1, lambda = 2)) |>
ungroup()
#Examining the aggregate data
agg_data |>
slice_head(n = 5)
#> # A tibble: 5 Γ 3
#> sex age_group n
#> <chr> <chr> <int>
#> 1 female 0-19 4
#> 2 female 20-29 1
#> 3 female 30-39 1
#> 4 female 40-49 2
#> 5 female 50-59 0
#Using the uncount function
linelist_data <- agg_data |>
uncount(weights = n)
#Examining the linelist data
linelist_data |>
slice_head(n = 5)
#> # A tibble: 5 Γ 2
#> sex age_group
#> <chr> <chr>
#> 1 female 0-19
#> 2 female 0-19
#> 3 female 0-19
#> 4 female 0-19
#> 5 female 20-29
Created on 2022-04-22 by the reprex package (v2.0.1)
Session info
sessioninfo::session_info()
#> β Session info βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> setting value
#> version R version 4.1.3 (2022-03-10)
#> os macOS Big Sur/Monterey 10.16
#> system x86_64, darwin17.0
#> ui X11
#> language (EN)
#> collate en_CA.UTF-8
#> ctype en_CA.UTF-8
#> tz America/Toronto
#> date 2022-04-22
#> pandoc 2.17.1.1 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/ (via rmarkdown)
#>
#> β Packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.0)
#> broom 0.8.0 2022-04-13 [1] CRAN (R 4.1.3)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.1.0)
#> cli 3.2.0 2022-02-14 [1] RSPM (R 4.1.2)
#> colorspace 2.0-3 2022-02-21 [1] RSPM (R 4.1.2)
#> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.1.3)
#> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.1)
#> dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.1)
#> dplyr * 1.0.8 2022-02-08 [1] RSPM (R 4.1.2)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
#> evaluate 0.15 2022-02-18 [1] RSPM (R 4.1.2)
#> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.1.3)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
#> forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.1)
#> generics 0.1.2 2022-01-31 [1] RSPM (R 4.1.2)
#> ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
#> glue 1.6.2 2022-02-24 [1] RSPM (R 4.1.2)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
#> haven 2.5.0 2022-04-15 [1] CRAN (R 4.1.3)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
#> hms 1.1.1 2021-09-26 [1] CRAN (R 4.1.1)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.0)
#> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
#> jsonlite 1.8.0 2022-02-22 [1] RSPM (R 4.1.2)
#> knitr 1.38 2022-03-25 [1] CRAN (R 4.1.3)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1)
#> lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.1.1)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.1.3)
#> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.1.0)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
#> pillar 1.7.0 2022-02-01 [1] RSPM (R 4.1.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
#> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.0)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.0)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.0)
#> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.1)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.0)
#> readr * 2.1.2 2022-01-30 [1] RSPM (R 4.1.2)
#> readxl 1.4.0 2022-03-28 [1] CRAN (R 4.1.3)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.0)
#> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.1.2)
#> rmarkdown 2.13 2022-03-10 [1] CRAN (R 4.1.2)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
#> rvest 1.0.2 2021-10-16 [1] CRAN (R 4.1.1)
#> scales 1.2.0 2022-04-13 [1] CRAN (R 4.1.3)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.1)
#> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.1)
#> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
#> styler 1.7.0 2022-03-13 [1] CRAN (R 4.1.2)
#> tibble * 3.1.6 2021-11-07 [1] CRAN (R 4.1.1)
#> tidyr * 1.2.0 2022-02-01 [1] RSPM (R 4.1.2)
#> tidyselect 1.1.2 2022-02-21 [1] RSPM (R 4.1.2)
#> tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.1.0)
#> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.1.3)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
#> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.1.3)
#> withr 2.5.0 2022-03-03 [1] RSPM (R 4.1.2)
#> xfun 0.30 2022-03-02 [1] RSPM (R 4.1.2)
#> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.1.1)
#> yaml 2.3.5 2022-02-21 [1] RSPM (R 4.1.2)
#>
#> [1] /Users/timothychisamore/Library/R/x86_64/4.1/library
#> [2] /Library/Frameworks/R.framework/Versions/4.1/Resources/library
#>
#> ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
All the best,
Tim
3 Likes
This is such a simple solution, thank you for taking the time to share it with me!
Regards
2 Likes