# Creating overlapping population pyramid from DHS across years, for comparison

Hello everyone, the {apyramid} have been quite good for plotting age pyramid plots for quick age-sex pyramid plots. We have a question from Cambodia cohort on how to plot population pyramid from DHS data across years with overlapping bars from the different years in one plot?
I have searched across forums and [plot - Population pyramid w projection in R - Stack Overflow] have similar examples to what it should look like except the projections are not needed.

Is there a possibility to do the same plots using the `age_pyramid` function {apyramid} package or any other simpler method? I have tried the stack_by argument, but it gives stacked plots instead.

``````age_pyramid(data = combined,
age_group = "age_cat",
split_by = "gender",
stack_by =year,
proportional = TRUE,
na.rm = TRUE)
``````

thanks a lot for taking your time.

2 Likes

Hello,

I am not overly familiar with the apyramid package but it appears it is not capable of doing what you are interested in.

However, you could totally write some code to do it using the ggplot2 package. Below is how I would approach the problem. Of course, you could generalize it but this should get you started.

``````library(tidyverse)

fake_data <- bind_rows(list(`2008` = apyramid::us_2008, `2018` = apyramid::us_2018),
.id = "year")

label_abs <- function(x) {
scales::label_comma()(abs(x))
}

fake_data |>
group_by(year, age) |>
summarize(count = sum(count), .groups = "drop") |>
mutate(year = fct(x = year, levels = c("2008", "2018")),
negative_count = -count) |>
ggplot(mapping = aes(y = age)) +
geom_col(mapping = aes(x = count, fill = fct_rev(year))) +
geom_col(mapping = aes(x = negative_count, fill = fct_rev(year))) +
scale_x_continuous(breaks = scales::breaks_extended(),
labels = label_abs) +
scale_fill_brewer(type = "seq") +
labs(x = "\nPopulation",
y = "Age Group\n",
fill = "Year") +
guides(fill = guide_legend(reverse = TRUE)) +
theme_minimal() +
theme(legend.position = "bottom")
``````

Created on 2022-10-28 with reprex v2.0.2

Session info
``````sessioninfo::session_info()
#> β Session info βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_CA.UTF-8
#>  ctype    en_CA.UTF-8
#>  tz       America/Toronto
#>  date     2022-10-28
#>  pandoc   2.18 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/tools/ (via rmarkdown)
#>
#> β Packages βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
#>  package       * version date (UTC) lib source
#>  apyramid        0.1.2   2020-05-08 [1] CRAN (R 4.2.0)
#>  assertthat      0.2.1   2019-03-21 [1] CRAN (R 4.2.0)
#>  backports       1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  broom           1.0.1   2022-08-29 [1] RSPM (R 4.2.1)
#>  cellranger      1.1.0   2016-07-27 [1] CRAN (R 4.2.0)
#>  cli             3.4.1   2022-09-23 [1] CRAN (R 4.2.0)
#>  colorspace      2.0-3   2022-02-21 [1] CRAN (R 4.2.0)
#>  crayon          1.5.2   2022-09-29 [1] CRAN (R 4.2.0)
#>  curl            4.3.3   2022-10-06 [1] CRAN (R 4.2.0)
#>  DBI             1.1.3   2022-06-18 [1] CRAN (R 4.2.0)
#>  dbplyr          2.2.1   2022-06-27 [1] CRAN (R 4.2.1)
#>  digest          0.6.30  2022-10-18 [1] CRAN (R 4.2.0)
#>  dplyr         * 1.0.10  2022-09-01 [1] RSPM (R 4.2.1)
#>  ellipsis        0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate        0.17    2022-10-07 [1] CRAN (R 4.2.1)
#>  fansi           1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  farver          2.1.1   2022-07-06 [1] CRAN (R 4.2.0)
#>  fastmap         1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  forcats       * 0.5.2   2022-08-19 [1] RSPM (R 4.2.1)
#>  fs              1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  gargle          1.2.1   2022-09-08 [1] RSPM (R 4.2.1)
#>  generics        0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2       * 3.3.6   2022-05-03 [1] CRAN (R 4.2.0)
#>  glue            1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  googledrive     2.0.0   2021-07-08 [1] CRAN (R 4.2.0)
#>  googlesheets4   1.0.1   2022-08-13 [1] CRAN (R 4.2.1)
#>  gtable          0.3.1   2022-09-01 [1] RSPM (R 4.2.1)
#>  haven           2.5.1   2022-08-22 [1] RSPM (R 4.2.1)
#>  highr           0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  hms             1.1.2   2022-08-19 [1] RSPM (R 4.2.1)
#>  htmltools       0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
#>  httr            1.4.4   2022-08-17 [1] RSPM (R 4.2.1)
#>  jsonlite        1.8.3   2022-10-21 [1] CRAN (R 4.2.1)
#>  knitr           1.40    2022-08-24 [1] RSPM (R 4.2.1)
#>  labeling        0.4.2   2020-10-20 [1] CRAN (R 4.2.0)
#>  lifecycle       1.0.3   2022-10-07 [1] CRAN (R 4.2.1)
#>  lubridate       1.8.0   2021-10-07 [1] CRAN (R 4.2.0)
#>  magrittr        2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  mime            0.12    2021-09-28 [1] CRAN (R 4.2.0)
#>  modelr          0.1.9   2022-08-19 [1] RSPM (R 4.2.1)
#>  munsell         0.5.0   2018-06-12 [1] CRAN (R 4.2.0)
#>  pillar          1.8.1   2022-08-19 [1] RSPM (R 4.2.1)
#>  pkgconfig       2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  purrr         * 0.3.5   2022-10-06 [1] CRAN (R 4.2.0)
#>  R.cache         0.16.0  2022-07-21 [1] CRAN (R 4.2.0)
#>  R.methodsS3     1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo            1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils         2.12.0  2022-06-28 [1] CRAN (R 4.2.1)
#>  R6              2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  RColorBrewer    1.1-3   2022-04-03 [1] CRAN (R 4.2.0)
#>  readr         * 2.1.3   2022-10-01 [1] CRAN (R 4.2.0)
#>  readxl          1.4.1   2022-08-17 [1] RSPM (R 4.2.1)
#>  reprex          2.0.2   2022-08-17 [1] RSPM (R 4.2.1)
#>  rlang           1.0.6   2022-09-24 [1] CRAN (R 4.2.0)
#>  rmarkdown       2.17    2022-10-07 [1] CRAN (R 4.2.1)
#>  rstudioapi      0.14    2022-08-22 [1] RSPM (R 4.2.1)
#>  rvest           1.0.3   2022-08-19 [1] RSPM (R 4.2.1)
#>  scales          1.2.1   2022-08-20 [1] RSPM (R 4.2.1)
#>  sessioninfo     1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi         1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr       * 1.4.1   2022-08-20 [1] RSPM (R 4.2.1)
#>  styler          1.8.0   2022-10-22 [1] CRAN (R 4.2.1)
#>  tibble        * 3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
#>  tidyr         * 1.2.1   2022-09-08 [1] RSPM (R 4.2.1)
#>  tidyselect      1.2.0   2022-10-10 [1] CRAN (R 4.2.0)
#>  tidyverse     * 1.3.2   2022-07-18 [1] CRAN (R 4.2.0)
#>  tzdb            0.3.0   2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8            1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs           0.5.0   2022-10-22 [1] CRAN (R 4.2.0)
#>  withr           2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun            0.34    2022-10-18 [1] CRAN (R 4.2.0)
#>  xml2            1.3.3   2021-11-30 [1] CRAN (R 4.2.0)
#>  yaml            2.3.6   2022-10-18 [1] CRAN (R 4.2.0)
#>
#>  [1] /Users/timothychisamore/Library/R/x86_64/4.2/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#>
#> ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
``````

All the best,

Tim

hey berhe - the apyramid code you showed, should do what you have asked for.
We have an example in the vignette that stack_by a categorical var (down the bottom just before survey data), but i think we set a limit so if the number of levels >=2 then it would switch to a side-by-side bar. Maybe we need to reconsider thisβ¦

1 Like

sorry @berhe_tesfay I now see what you mean about the stacks.
The first dense bit of this code is just copying the data from stackoverflow. Code for the plot is from βADAPT DATAβ onwards.
Need to do a bit of fiddling to get the difference in counts by year - and then can pass it to {apyramid}.
Not entirely sure this is easier than just using {ggplot} as @machupovirus did β¦

``````library("dplyr") ## data wrangling
library("tidyr") ## data wrangling
library("wpp2015") ## example data from stackoverflow
library("apyramid") ## plot pyramid
library("ggplot2")  ## edit pyramid (add steps)

############# GET DATA
## this section is just copy pasted from
# https://stackoverflow.com/questions/15816073/population-pyramid-w-projection-in-r
data(popF)
data(popM)
data(popFprojMed)
data(popMprojMed)

#combine past and future female population
df0 <- popF %>%
left_join(popFprojMed) %>%
mutate(gender = "female")
#> Joining, by = c("country", "country_code", "age")

#combine past and future male population, add on female population
df1 <- popM %>%
left_join(popMprojMed) %>%
mutate(gender = "male") %>%
bind_rows(df0) %>%
mutate(age = factor(age, levels = unique(age)))
#> Joining, by = c("country", "country_code", "age")

#stack data for ggplot, filter World population and required years
df2 <- df1 %>%
gather(key = year, value = pop, -country, -country_code, -age, -gender) %>%
mutate(pop = pop/1e03) %>%
filter(country == "World", year %in% c(1950, 2000, 2050, 2100))

#add extra rows and numeric age variable for geom_step used for 2100
df2 <- df2 %>%
mutate(ageno = as.numeric(age) - 0.5)

df2 <- df2 %>%
bind_rows(df2 %>% filter(year==2100, age=="100+") %>% mutate(ageno = ageno + 1))

## here we need to edit our counts so that just the difference is shown
## (i.e. the extra for each year)

## create a new data set
df3 <- df2 %>%
## drop the prediction year
filter(year != 2100) %>%
## sort data by appropriate groups
arrange(age, gender, year) %>%
## calculate the difference between current and previous year
mutate(new_pop = pop - dplyr::lag(pop)) %>%
## replace first year with the original count
mutate(new_pop = if_else(year == 1950, pop, new_pop))

## plot age pyramid as normal
age_pyramid(df3,
age_group = age,
split_by = gender,
stack_by = year,
count = new_pop,
show_midpoint = FALSE) +
## add in line for 2100 (this is copy pasted from stack overflow but just added in pop)
## note that this uses the original dataset
#steps for 2100
geom_step(data =  df2 %>% filter(gender == "female", year == 2100),
aes(x = ageno, y = pop), size = 1) +
geom_step(data =  df2 %>% filter(gender == "male", year == 2100),
aes(x = ageno, y = -pop), size = 1)
``````

Created on 2022-11-06 with reprex v2.0.2

3 Likes

I am tagging @bsomethea because he was originally asking the question in the Cambodia cohort.

Somethea, @berhe_tesfay was investigating your question about stacked age pyramids. There are some solutions posted by the team. You can use this forum to post other questions as well.

Please see the instructions on how to post an R code question.
http://community.appliedepi.org/t/how-to-post-an-r-code-question/623/4

1 Like

I want to do an age pyramid splitting by gender for different countries in my dataset.

# age pyramid

age_pyramid(data = combined,

age_group = βage_catβ, #note that the column must be enclosed in quotation marks (" ")

split_by = βgenderβ,

proportional = TRUE,

na.rm = TRUE)

Can I use split_by = βgenderβ and add another split_by =βcountryβ at the same time? This has been my challenge.

This code does not work. I need to fix this urgently. Been on it for over a week and close to my deadline.

# age_pyramid(data = combined,

``````        age_group = "age_cat", #note that the column must be enclosed in quotation marks (" ")
split_by = "gender",
proportional = TRUE,
na.rm = TRUE)  +
facet_wrap(~country)
``````

Any idea will go a long way. Thanks

1 Like

Hi @maworh, if you are looking to facet the pyramid plot, the following code should solve it for you. If you are looking to fill it by country just change the fill gender to the district which is multi level. after that the game is about

``````pyr_data <- combined %>%
# select the variables of interest (consider district is a country)
select(age_cat, gender, district) %>%

# group  by the selecting variables
group_by(age_cat,gender, district) %>%

# create the grouped case count
summarise(
n = n()
)

# use ggplot to plot the population pyramid
ggplot(data = pyr_data,
aes(x = age_cat,         # x-axis is age cat
fill = gender,       # fill by gender
# create the left and right values
y = ifelse(gender == "male", -n, n))) +
# fix identity to stat otherwaie will throw error
geom_bar(stat = "identity") +
# use abs for labels
scale_y_continuous(labels = abs,
# change the negative value to the left to positive my multplying fixed interval
limits = max(payramid_data\$n)*c(-1,1)) +
labs(x = "Gender of the cases (splitted)",
y = "Age catgory of the cases",
fill = "Gender") +
# flip the bar chart
coord_flip() +
# facet by district (in your case it will be country)
facet_wrap(~district)
``````

If you want to use country as a fill, just change the `fill = gender` in to `fill = district` in this cases and country in your scenario. I hope this is helpful.

1 Like

However, the age categories were not complete even after converting to a factor before trying the codes

Species\$age_group β factor(Species\$age_group, levels = c(β0-2β, β3-12β, β13-64β, β65+β))

category β3-12β was missing.

How do I remove the βNAβ?

This is the image I got

1 Like

@mabelaworh you can drop the NA values before you plot the pyramid. however in you case first you need to change the empty place holder to NA I guess, as it is not clear in your legend. to drop the NA value, you can use multiple ways including filter but the easiest see below.

``````# first create the summarised data set from the combined data
pyr_data <- combined %>%
# first make sure NA values are properly leveled
mutate(age_cat = na_if(age_cat, "")) %>%
# drop the missing values of NA
drop_na(age_cat) %>%
# select the variables of interest (consider distict is a country)
select(age_cat, gender, district) %>%
# group  by the selecting variables
group_by(age_cat,gender, district) %>%
# create the grouped case count
summarise(
n = n()
)
``````
1 Like

Got another challenge while trying to plot multiple bar charts with this code everything is fine.
ggplot(
data = regions,
mapping = aes(
x = species,
fill = Phenotype)) + scale_fill_manual(values = phenotype_colors) +
geom_bar() +

scale_y_continuous(breaks = seq(from = 0,
to = 10000,
by = 1000),
expand = c(0,0)) +

scale_x_discrete(expand = c(0,0))+
coord_flip()+
theme_classic()+
facet_wrap(~ Regions)+
labs(title = ) +
xlab(βWHO priority pathogensβ) + ylab(βNumber of isolatesβ)

But with this code some rows are deleted even though there are no missing data
ggplot(
data = regions2,
mapping = aes(
x = species, y = percentages,
fill = Phenotype)) +
geom_bar(stat=βidentityβ) +
scale_fill_manual(values = phenotype_colors) +
scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
scale_x_discrete(expand = c(0, 0)) +
coord_flip() +
facet_wrap(~ Regions) +
labs(
title = ,
x = βSpeciesβ,
y = βPercentageβ)

I want the y axis in percentage and dont want any rows removed because there are no missing data. How do I fix my codes to make this work.

Thanks

Hi @mabelaworh,

Then you can ask your next question in a separate post. This way, it is easier for other people to find the helpful conversations later.

If possible, can you view this video about how to provide a βreproducibleβ code example with your question? It makes it much easier for us to help you.

At the least, it is nice to put you code in βback ticksβ like this:

`````````
data <- import(...)
```
``````

This makes it easier for us to copy and paste it into R.