Creating overlapping population pyramid from DHS across years, for comparison

berhe_tesfay · October 28, 2022, 8:54pm

Hello everyone, the {apyramid} have been quite good for plotting age pyramid plots for quick age-sex pyramid plots. We have a question from Cambodia cohort on how to plot population pyramid from DHS data across years with overlapping bars from the different years in one plot?
I have searched across forums and [plot - Population pyramid w projection in R - Stack Overflow] have similar examples to what it should look like except the projections are not needed.

Is there a possibility to do the same plots using the age_pyramid function {apyramid} package or any other simpler method? I have tried the stack_by argument, but it gives stacked plots instead.

age_pyramid(data = combined,
            age_group = "age_cat", 
            split_by = "gender",
            stack_by =year,
            proportional = TRUE,
            na.rm = TRUE)

thanks a lot for taking your time.

machupovirus · October 28, 2022, 9:56pm

Hello,

I am not overly familiar with the apyramid package but it appears it is not capable of doing what you are interested in.

However, you could totally write some code to do it using the ggplot2 package. Below is how I would approach the problem. Of course, you could generalize it but this should get you started.

library(tidyverse)

fake_data <- bind_rows(list(`2008` = apyramid::us_2008, `2018` = apyramid::us_2018),
                                             .id = "year")

label_abs <- function(x) {
    scales::label_comma()(abs(x))
}

fake_data |>
    group_by(year, age) |>
    summarize(count = sum(count), .groups = "drop") |>
    mutate(year = fct(x = year, levels = c("2008", "2018")),
                 negative_count = -count) |>
    ggplot(mapping = aes(y = age)) +
    geom_col(mapping = aes(x = count, fill = fct_rev(year))) +
    geom_col(mapping = aes(x = negative_count, fill = fct_rev(year))) +
    scale_x_continuous(breaks = scales::breaks_extended(),
                                         labels = label_abs) +
    scale_fill_brewer(type = "seq") +
    labs(x = "\nPopulation",
             y = "Age Group\n",
             fill = "Year") +
    guides(fill = guide_legend(reverse = TRUE)) +
    theme_minimal() +
    theme(legend.position = "bottom")

^{Created on 2022-10-28 with reprex v2.0.2}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_CA.UTF-8
#>  ctype    en_CA.UTF-8
#>  tz       America/Toronto
#>  date     2022-10-28
#>  pandoc   2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package       * version date (UTC) lib source
#>  apyramid        0.1.2   2020-05-08 [1] CRAN (R 4.2.0)
#>  assertthat      0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  backports       1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  broom           1.0.1   2022-08-29 [1] RSPM (R 4.2.1)
#>  cellranger      1.1.0   2016-07-27 [1] CRAN (R 4.2.0)
#>  cli             3.4.1   2022-09-23 [1] CRAN (R 4.2.0)
#>  colorspace      2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon          1.5.2   2022-09-29 [1] CRAN (R 4.2.0)
#>  curl            4.3.3   2022-10-06 [1] CRAN (R 4.2.0)
#>  DBI             1.1.3   2022-06-18 [1] CRAN (R 4.2.0)
#>  dbplyr          2.2.1   2022-06-27 [1] CRAN (R 4.2.1)
#>  digest          0.6.30  2022-10-18 [1] CRAN (R 4.2.0)
#>  dplyr         * 1.0.10  2022-09-01 [1] RSPM (R 4.2.1)
#>  ellipsis        0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate        0.17    2022-10-07 [1] CRAN (R 4.2.1)
#>  fansi           1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  farver          2.1.1   2022-07-06 [1] CRAN (R 4.2.0)
#>  fastmap         1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  forcats       * 0.5.2   2022-08-19 [1] RSPM (R 4.2.1)
#>  fs              1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  gargle          1.2.1   2022-09-08 [1] RSPM (R 4.2.1)
#>  generics        0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2       * 3.3.6   2022-05-03 [1] CRAN (R 4.2.0)
#>  glue            1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  googledrive     2.0.0   2021-07-08 [1] CRAN (R 4.2.0)
#>  googlesheets4   1.0.1   2022-08-13 [1] CRAN (R 4.2.1)
#>  gtable          0.3.1   2022-09-01 [1] RSPM (R 4.2.1)
#>  haven           2.5.1   2022-08-22 [1] RSPM (R 4.2.1)
#>  highr           0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms             1.1.2   2022-08-19 [1] RSPM (R 4.2.1)
#>  htmltools       0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
#>  httr            1.4.4   2022-08-17 [1] RSPM (R 4.2.1)
#>  jsonlite        1.8.3   2022-10-21 [1] CRAN (R 4.2.1)
#>  knitr           1.40    2022-08-24 [1] RSPM (R 4.2.1)
#>  labeling        0.4.2   2020-10-20 [1] CRAN (R 4.2.0)
#>  lifecycle       1.0.3   2022-10-07 [1] CRAN (R 4.2.1)
#>  lubridate       1.8.0   2021-10-07 [1] CRAN (R 4.2.0)
#>  magrittr        2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  mime            0.12    2021-09-28 [1] CRAN (R 4.2.0)
#>  modelr          0.1.9   2022-08-19 [1] RSPM (R 4.2.1)
#>  munsell         0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar          1.8.1   2022-08-19 [1] RSPM (R 4.2.1)
#>  pkgconfig       2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr         * 0.3.5   2022-10-06 [1] CRAN (R 4.2.0)
#>  R.cache         0.16.0  2022-07-21 [1] CRAN (R 4.2.0)
#>  R.methodsS3     1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo            1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils         2.12.0  2022-06-28 [1] CRAN (R 4.2.1)
#>  R6              2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  RColorBrewer    1.1-3   2022-04-03 [1] CRAN (R 4.2.0)
#>  readr         * 2.1.3   2022-10-01 [1] CRAN (R 4.2.0)
#>  readxl          1.4.1   2022-08-17 [1] RSPM (R 4.2.1)
#>  reprex          2.0.2   2022-08-17 [1] RSPM (R 4.2.1)
#>  rlang           1.0.6   2022-09-24 [1] CRAN (R 4.2.0)
#>  rmarkdown       2.17    2022-10-07 [1] CRAN (R 4.2.1)
#>  rstudioapi      0.14    2022-08-22 [1] RSPM (R 4.2.1)
#>  rvest           1.0.3   2022-08-19 [1] RSPM (R 4.2.1)
#>  scales          1.2.1   2022-08-20 [1] RSPM (R 4.2.1)
#>  sessioninfo     1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi         1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr       * 1.4.1   2022-08-20 [1] RSPM (R 4.2.1)
#>  styler          1.8.0   2022-10-22 [1] CRAN (R 4.2.1)
#>  tibble        * 3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
#>  tidyr         * 1.2.1   2022-09-08 [1] RSPM (R 4.2.1)
#>  tidyselect      1.2.0   2022-10-10 [1] CRAN (R 4.2.0)
#>  tidyverse     * 1.3.2   2022-07-18 [1] CRAN (R 4.2.0)
#>  tzdb            0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8            1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs           0.5.0   2022-10-22 [1] CRAN (R 4.2.0)
#>  withr           2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun            0.34    2022-10-18 [1] CRAN (R 4.2.0)
#>  xml2            1.3.3   2021-11-30 [1] CRAN (R 4.2.0)
#>  yaml            2.3.6   2022-10-18 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Users/timothychisamore/Library/R/x86_64/4.2/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

All the best,

Tim

aspina · November 5, 2022, 10:02am

hey berhe - the apyramid code you showed, should do what you have asked for.
We have an example in the vignette that stack_by a categorical var (down the bottom just before survey data), but i think we set a limit so if the number of levels >=2 then it would switch to a side-by-side bar. Maybe we need to reconsider this…

aspina · November 6, 2022, 3:43pm

sorry @berhe_tesfay I now see what you mean about the stacks.
The first dense bit of this code is just copying the data from stackoverflow. Code for the plot is from “ADAPT DATA” onwards.
Need to do a bit of fiddling to get the difference in counts by year - and then can pass it to {apyramid}.
Not entirely sure this is easier than just using {ggplot} as @machupovirus did …

library("dplyr") ## data wrangling
library("tidyr") ## data wrangling
library("wpp2015") ## example data from stackoverflow
library("apyramid") ## plot pyramid 
library("ggplot2")  ## edit pyramid (add steps)

############# GET DATA 
## this section is just copy pasted from 
# https://stackoverflow.com/questions/15816073/population-pyramid-w-projection-in-r
#load data in wpp2015
data(popF)
data(popM)
data(popFprojMed)
data(popMprojMed)

#combine past and future female population
df0 <- popF %>% 
  left_join(popFprojMed) %>%
  mutate(gender = "female")
#> Joining, by = c("country", "country_code", "age")

#combine past and future male population, add on female population
df1 <- popM %>% 
  left_join(popMprojMed) %>%
  mutate(gender = "male") %>%
  bind_rows(df0) %>%
  mutate(age = factor(age, levels = unique(age)))
#> Joining, by = c("country", "country_code", "age")

#stack data for ggplot, filter World population and required years
df2 <- df1 %>%
  gather(key = year, value = pop, -country, -country_code, -age, -gender) %>%
  mutate(pop = pop/1e03) %>%
  filter(country == "World", year %in% c(1950, 2000, 2050, 2100))

#add extra rows and numeric age variable for geom_step used for 2100
df2 <- df2 %>% 
  mutate(ageno = as.numeric(age) - 0.5)

df2 <- df2 %>%
  bind_rows(df2 %>% filter(year==2100, age=="100+") %>% mutate(ageno = ageno + 1)) 



####### ADAPT DATA 

## here we need to edit our counts so that just the difference is shown
## (i.e. the extra for each year)

## create a new data set
df3 <- df2 %>% 
  ## drop the prediction year 
  filter(year != 2100) %>% 
  ## sort data by appropriate groups
  arrange(age, gender, year) %>%
  ## calculate the difference between current and previous year
  mutate(new_pop = pop - dplyr::lag(pop)) %>% 
  ## replace first year with the original count
  mutate(new_pop = if_else(year == 1950, pop, new_pop))

## plot age pyramid as normal 
age_pyramid(df3, 
            age_group = age, 
            split_by = gender, 
            stack_by = year, 
            count = new_pop, 
            show_midpoint = FALSE) + 
  ## add in line for 2100 (this is copy pasted from stack overflow but just added in pop)
  ## note that this uses the original dataset
  #steps for 2100
  geom_step(data =  df2 %>% filter(gender == "female", year == 2100), 
            aes(x = ageno, y = pop), size = 1) +
  geom_step(data =  df2 %>% filter(gender == "male", year == 2100), 
            aes(x = ageno, y = -pop), size = 1)

^{Created on 2022-11-06 with reprex v2.0.2}

neale · November 9, 2022, 8:47pm

I am tagging @bsomethea because he was originally asking the question in the Cambodia cohort.

Somethea, @berhe_tesfay was investigating your question about stacked age pyramids. There are some solutions posted by the team. You can use this forum to post other questions as well.

Please see the instructions on how to post an R code question.
http://community.appliedepi.org/t/how-to-post-an-r-code-question/623/4

maworh · June 16, 2023, 7:31pm

I want to do an age pyramid splitting by gender for different countries in my dataset.

age pyramid

age_pyramid(data = combined,

age_group = “age_cat”, #note that the column must be enclosed in quotation marks (" ")

split_by = “gender”,

proportional = TRUE,

na.rm = TRUE)

Can I use split_by = “gender” and add another split_by =“country” at the same time? This has been my challenge.

This code does not work. I need to fix this urgently. Been on it for over a week and close to my deadline.

age_pyramid(data = combined,

        age_group = "age_cat", #note that the column must be enclosed in quotation marks (" ")
        split_by = "gender",
        proportional = TRUE,
        na.rm = TRUE)  +
 facet_wrap(~country)

Any idea will go a long way. Thanks

berhe_tesfay · June 17, 2023, 5:21pm

Hi @maworh, if you are looking to facet the pyramid plot, the following code should solve it for you. If you are looking to fill it by country just change the fill gender to the district which is multi level. after that the game is about

pyr_data <- combined %>% 
        # select the variables of interest (consider district is a country)
        select(age_cat, gender, district) %>% 
        
        # group  by the selecting variables 
        group_by(age_cat,gender, district) %>% 
        
        # create the grouped case count
        summarise(
                n = n()
        )

# use ggplot to plot the population pyramid 
ggplot(data = pyr_data,
       aes(x = age_cat,         # x-axis is age cat
           fill = gender,       # fill by gender 
           # create the left and right values
       y = ifelse(gender == "male", -n, n))) +   
        # fix identity to stat otherwaie will throw error
        geom_bar(stat = "identity") + 
        # use abs for labels
        scale_y_continuous(labels = abs,  
                           # change the negative value to the left to positive my multplying fixed interval
                           limits = max(payramid_data$n)*c(-1,1)) + 
        labs(x = "Gender of the cases (splitted)",
             y = "Age catgory of the cases",
             fill = "Gender") +
        # flip the bar chart
        coord_flip() + 
        # facet by district (in your case it will be country)
        facet_wrap(~district)

If you want to use country as a fill, just change the fill = gender in to fill = district in this cases and country in your scenario. I hope this is helpful.

mabelaworh · June 17, 2023, 10:25pm

Thanks for your help.
However, the age categories were not complete even after converting to a factor before trying the codes

Species$age_group ← factor(Species$age_group, levels = c(“0-2”, “3-12”, “13-64”, “65+”))

category “3-12” was missing.

How do I remove the “NA”?

This is the image I got

berhe_tesfay · June 18, 2023, 6:18am

@mabelaworh you can drop the NA values before you plot the pyramid. however in you case first you need to change the empty place holder to NA I guess, as it is not clear in your legend. to drop the NA value, you can use multiple ways including filter but the easiest see below.

# first create the summarised data set from the combined data 
pyr_data <- combined %>% 
        # first make sure NA values are properly leveled
        mutate(age_cat = na_if(age_cat, "")) %>% 
        # drop the missing values of NA
        drop_na(age_cat) %>% 
        # select the variables of interest (consider distict is a country)
        select(age_cat, gender, district) %>% 
        # group  by the selecting variables 
        group_by(age_cat,gender, district) %>% 
        # create the grouped case count
        summarise(
                n = n()
        )

mabelaworh · June 19, 2023, 6:00pm

Thanks Berhe for your help.

Got another challenge while trying to plot multiple bar charts with this code everything is fine.
ggplot(
data = regions,
mapping = aes(
x = species,
fill = Phenotype)) + scale_fill_manual(values = phenotype_colors) +
geom_bar() +

scale_y_continuous(breaks = seq(from = 0,
to = 10000,
by = 1000),
expand = c(0,0)) +

scale_x_discrete(expand = c(0,0))+
coord_flip()+
theme_classic()+
facet_wrap(~ Regions)+
labs(title = ) +
xlab(“WHO priority pathogens”) + ylab(“Number of isolates”)

But with this code some rows are deleted even though there are no missing data
ggplot(
data = regions2,
mapping = aes(
x = species, y = percentages,
fill = Phenotype)) +
geom_bar(stat=“identity”) +
scale_fill_manual(values = phenotype_colors) +
scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
scale_x_discrete(expand = c(0, 0)) +
coord_flip() +
facet_wrap(~ Regions) +
labs(
title = ,
x = “Species”,
y = “Percentage”)

I want the y axis in percentage and dont want any rows removed because there are no missing data. How do I fix my codes to make this work.

Thanks

neale · June 19, 2023, 9:23pm

Hi @mabelaworh,

If @berhe_tesfay was able to help you, can you mark his response as the “Solution” please?

Then you can ask your next question in a separate post. This way, it is easier for other people to find the helpful conversations later.

If possible, can you view this video about how to provide a “reproducible” code example with your question? It makes it much easier for us to help you.

At the least, it is nice to put you code in “back ticks” like this:

```
#paste your code here
data <- import(...)
```

This makes it easier for us to copy and paste it into R.

Happy to answer any questions you have about the above!
Neale

maworh · June 20, 2023, 6:54pm

SOLUTION.

Thanks. it worked