Separate summary statistics for different levels of group_by()

stevensola · June 25, 2024, 2:30pm

First real post here! I’m creating summary tables for my dissertation using group_by() %>% summarise(). My data is in a multi-level format, persons located within households. The goal is to group households by whether they have any animals, and look at the diarrhea prevalence within those households. Here is the data and code that I have so far:

library(tidyverse)

# Create person level data
person_level <- tibble::tribble(
                 ~household_id, ~person_id, ~age, ~diarrhea, ~animal, ~animal_total,
                 1,          1,    2,         1,       1,             5,
                 1,          2,    3,         0,       1,             5,
                 1,          3,   65,        NA,       1,             5,
                 1,          4,   66,        NA,       1,             5,
                 2,          1,    2,         1,       0,             0,
                 2,          2,    4,         0,       0,             0,
                 2,          3,    8,        NA,       0,             0,
                 2,          4,   55,        NA,       0,             0
               )

# "convert" to household level
# Used to get # of animals per household
# Person level shows household 1 has 20 animals
# While household 1 really only has 5 animals
household_level <- person_level %>% 
                    distinct(household_id, .keep_all = TRUE) 

# Create summary table
person_level %>% 
  group_by(animal) %>% 
  summarise(n = n(),
            animal_total = household_level %>% pull(animal_total) %>% sum,
            n_under5 = sum(age <= 5, na.rm = TRUE),
            num_diarrhea = sum(diarrhea, na.rm = TRUE),
            percent_diarrhea = round((num_diarrhea / n_under5), 4)*100) %>% 
  arrange(desc(animal))

I “convert” the person level data into household level to get the number of animals that each household has. If we keep it in the person level, it will show the household has 20 animals (5 animals * 4 people), whereas the household really only has 5 animals.

Within my summary table, I’m able to get the total number of animals among all households that own animals, but this number also shows for the group of people where animal = 0.

Goal: I want the “animal_total” within summarise() to only show for one group, and not the other. In other words, have the sum(animal) show only for the group who owns animals, and have a 0 show for those people that don’t have animals in the household.

Thanks in advance!

temuulen · June 25, 2024, 5:33pm

Hey Steven,

It looks like an if else statement would do the job here like this:

person_level %>%
  group_by(animal) %>%
  summarise(
    n = n(),
    animal_total = if (first(animal) == 1)
      sum(unique(animal_total))
    else
      0,
    n_under5 = sum(age <= 5, na.rm = TRUE),
    num_diarrhea = sum(diarrhea, na.rm = TRUE),
    percent_diarrhea = round((num_diarrhea / n_under5), 4) * 100
  ) %>%
  arrange(desc(animal))

Let me know if this works.

A few quick things to keep in mind:

This assumes everyone in a household has the same animal info.
For extra accuracy, you might want to join the household data back to the person data before summarizing.
Watch out for small numbers when calculating percentages - dividing by zero can cause issues.

Just something to think about as you work on your analysis.

stevensola · June 25, 2024, 10:59pm

Hi Temuulen,
Thanks for the solution! Honestly, that worked for the dataset that I posted. Unfortunately, it was probably too minimal of a dataset. Here’s another (slightly bigger) dataset that shows where the code breaks down:

person_level <- tibble::tribble(
                 ~household_id, ~person_id, ~age, ~diarrhea, ~animal, ~animal_total,
                 1,          1,    2,         1,       1,             5,
                 1,          2,    3,         0,       1,             5,
                 1,          3,   65,        NA,       1,             5,
                 1,          4,   66,        NA,       1,             5,
                 2,          1,    2,         1,       0,             0,
                 2,          2,    4,         0,       0,             0,
                 2,          3,    8,        NA,       0,             0,
                 2,          4,   55,        NA,       0,             0,
                 3,          1,    3,         1,       1,             5,
                 3,          2,   45,        NA,       1,             5,
               )

In this dataset, the “animal_total” has a duplicate value. This is similar to my larger dataset, where the values range from 1-95. I understand that the sum(unique(animal_total)) just sums the unique values of animal_total.

How can we change the code so that the summary table shows “10” for animal_total instead of 5?

Removing the unique() results in 30. It would be great if the unique() “reset” for each household.

My best at editing the code that you provided:

person_level %>%
  group_by(animal) %>%
  summarise(
    n = n(),
    animal_total = if (first(animal) == 1)
                   unique(household_id) %>% sum(unique(animal_total))
                   else 0,
    n_under5 = sum(age <= 5, na.rm = TRUE),
    num_diarrhea = sum(diarrhea, na.rm = TRUE),
    percent_diarrhea = round((num_diarrhea / n_under5), 4) * 100) %>%
  arrange(desc(animal))

If you run this code, you’ll see that animal_total = 9 in the summary table. I honestly have no idea where this 9 came from…

stevensola · June 25, 2024, 11:20pm

This was a solution that was proposed that seems to work perfectly:

person_level %>% 
  group_by(household_id) %>% 
  mutate(first_animal_total = if_else(row_number()==1, animal_total, 0 )) %>% 
  group_by(animal) %>% 
  summarise(n = n(),
            animal_total = sum(first_animal_total),
            n_under5 = sum(age <= 5, na.rm = TRUE),
            num_diarrhea = sum(diarrhea, na.rm = TRUE),
            percent_diarrhea = round((num_diarrhea / n_under5), 4)*100) %>% 
  arrange(desc(animal))

The key was a first group_by(), then create an entirely separate variable, then use that variable in a second group_by() and summarise() statement.

Interesting the two different ways to go about this. It’s good to see the many different ways to attack the problem.

lnielsen · June 26, 2024, 12:19pm

Great @stevensola, I was considering this problem scenario yesterday, and perhaps a lazy workaround could be simple using mutate and case_when to set animal_total to 0 when animal equals 0 before arrange. This might produce the desired outcome also right?, though I acknowledge it’s not the ideal solution.