Grouping Cancer Types by Site Code

lurkpa · May 31, 2024, 4:36pm

I am having trouble grouping cancer sites by site code. The data comes with three columns, site_code, site_description, and tot_count. I would like to group the site codes so that it combines the total amounts and lists them as such:

1-10 Oral Cancer
1-10, 34, 35 Head & Neck Cancer
14-24 Colorectal Cancer
60 & 61 Brain Cancer
64 & 65 Hodgkins Lymphoma
66 & 67 Non-hodgkins Lymphoma
69-77 Leukemia
all other unspecified codes would not need to be grouped but their tot counts are still important for analysis.

Please notice that for Oral Cancer and Head & Neck Cancer share overlap of codes. Is it possible to have the tot count count twice? Once in Oral Cancer and again in Head & Neck Cancer?

random_reprex <- data.frame(
  stringsAsFactors = FALSE,
  site_code = c(1,2,3,4,5,6,7,8,9,10,
                11,12,13,14,15,16,17,18,19,20,21,22,23,24,
                25,26,27,28,29,30,31,32,33,34,35,60,61,64,65,
                66,67,68,69,70,71,72,73,74,75,76,77),
  site_description = c("Lip","Tongue",
                       "Salivary Gland","Floor of Mouth","Gum and Other Mouth",
                       "Nasopharynx","Tonsil","Oropharynx","Hypopharynx",
                       "Other Oral Cavity and Pharynx","Esophagus","Stomach",
                       "Small Intestine","Cecum","Appendix","Ascending Colon",
                       "Hepatic Flexure","Transverse Colon","Splenic Flexure",
                       "Descending Colon","Sigmoid Colon","Large Intestine, NOS",
                       "Rectosigmoid Junction","Rectum",
                       "Anus, Anal Canal and Anorectum","Liver","Intrahepatic Bile Duct","Gallbladder",
                       "Other Biliary","Pancreas","Retroperitoneum",
                       "Peritoneum, Omentum and Mesentery","Other Digestive Organs",
                       "Nose, Nasal Cavity and Middle Ear","Larynx","Brain",
                       "Cranial Nerves Other Nervous System","Hodgkin - Nodal",
                       "Hodgkin - Extranodal","Non-Hodgkin Lymphoma - Nodal",
                       "Non-Hodgkin Lymphoma - Extranodal","Myeloma",
                       "Acute Lymphocytic Leukemia","Chronic Lymphocytic Leukemia",
                       "Other Lymphocytic Leukemia","Acute Myeloid Leukemia",
                       "Chronic Myeloid Leukemia","Other Myeloid/Monocytic Leukemia",
                       "Acute Monocytic Leukemia","Other Acute Leukemia",
                       "Aleukemic, subleukemic and NOS"),
  tot_count = c(162,1816,617,262,644,347,
                995,572,376,244,1599,2104,902,1730,646,1762,427,
                928,393,502,2029,858,790,2516,900,2399,785,
                478,637,4906,294,282,363,395,1135,1836,193,835,
                105,4831,2023,2886,516,2032,265,1691,821,182,235,
                197,334)
)

machupovirus · May 31, 2024, 7:03pm

Hello,

Assuming you want to keep these as mutual exclusive groups, I wouldn’t recommend overlapping the first two groups as you’ve stated. Below is how I would approach this question:

# loading packages
library(tidyverse)

# creating fake data
fake_data <- data.frame(
  stringsAsFactors = FALSE,
  site_code = c(
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
    11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
    25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 60, 61, 64, 65,
    66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77
  ),
  site_description = c(
    "Lip", "Tongue",
    "Salivary Gland", "Floor of Mouth", "Gum and Other Mouth",
    "Nasopharynx", "Tonsil", "Oropharynx", "Hypopharynx",
    "Other Oral Cavity and Pharynx", "Esophagus", "Stomach",
    "Small Intestine", "Cecum", "Appendix", "Ascending Colon",
    "Hepatic Flexure", "Transverse Colon", "Splenic Flexure",
    "Descending Colon", "Sigmoid Colon", "Large Intestine, NOS",
    "Rectosigmoid Junction", "Rectum",
    "Anus, Anal Canal and Anorectum", "Liver", "Intrahepatic Bile Duct", "Gallbladder",
    "Other Biliary", "Pancreas", "Retroperitoneum",
    "Peritoneum, Omentum and Mesentery", "Other Digestive Organs",
    "Nose, Nasal Cavity and Middle Ear", "Larynx", "Brain",
    "Cranial Nerves Other Nervous System", "Hodgkin - Nodal",
    "Hodgkin - Extranodal", "Non-Hodgkin Lymphoma - Nodal",
    "Non-Hodgkin Lymphoma - Extranodal", "Myeloma",
    "Acute Lymphocytic Leukemia", "Chronic Lymphocytic Leukemia",
    "Other Lymphocytic Leukemia", "Acute Myeloid Leukemia",
    "Chronic Myeloid Leukemia", "Other Myeloid/Monocytic Leukemia",
    "Acute Monocytic Leukemia", "Other Acute Leukemia",
    "Aleukemic, subleukemic and NOS"
  ),
  tot_count = c(
    162, 1816, 617, 262, 644, 347,
    995, 572, 376, 244, 1599, 2104, 902, 1730, 646, 1762, 427,
    928, 393, 502, 2029, 858, 790, 2516, 900, 2399, 785,
    478, 637, 4906, 294, 282, 363, 395, 1135, 1836, 193, 835,
    105, 4831, 2023, 2886, 516, 2032, 265, 1691, 821, 182, 235,
    197, 334
  )
) |> as_tibble()

# aggregating data
fake_data |>
    mutate(site_group = case_when(
        site_code %in% 1:10 ~ "Oral Cancer",
        site_code %in% c(34, 35) ~ "Head & Neck Cancer",
        site_code %in% 14:24 ~ "Colorectal Cancer",
        site_code %in% c(60, 61) ~ "Brain Cancer",
        site_code %in% c(64, 65) ~ "Hodgkins Lymphoma",
        site_code %in% c(66, 67) ~ "Non-hodgkins Lymphoma",
        site_code %in% 69:77 ~ "Leukemia",
        .default = "Residual"
    )) |>
    summarize(tot_count = sum(tot_count), .by = site_group)
#> # A tibble: 8 × 2
#>   site_group            tot_count
#>   <chr>                     <dbl>
#> 1 Oral Cancer                6035
#> 2 Residual                  18535
#> 3 Colorectal Cancer         12581
#> 4 Head & Neck Cancer         1530
#> 5 Brain Cancer               2029
#> 6 Hodgkins Lymphoma           940
#> 7 Non-hodgkins Lymphoma      6854
#> 8 Leukemia                   6273

^{Created on 2024-05-31 with reprex v2.1.0}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.0 (2024-04-24)
#>  os       macOS Sonoma 14.4.1
#>  system   x86_64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Toronto
#>  date     2024-05-31
#>  pandoc   3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/x86_64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.2   2023-12-11 [1] RSPM (R 4.4.0)
#>  colorspace    2.1-0   2023-01-23 [1] RSPM (R 4.4.0)
#>  digest        0.6.35  2024-03-11 [1] RSPM (R 4.4.0)
#>  dplyr       * 1.1.4   2023-11-17 [1] RSPM (R 4.4.0)
#>  evaluate      0.23    2023-11-01 [1] RSPM (R 4.4.0)
#>  fansi         1.0.6   2023-12-08 [1] RSPM (R 4.4.0)
#>  fastmap       1.2.0   2024-05-15 [1] RSPM (R 4.4.0)
#>  forcats     * 1.0.0   2023-01-29 [1] RSPM (R 4.4.0)
#>  fs            1.6.4   2024-04-25 [1] RSPM (R 4.4.0)
#>  generics      0.1.3   2022-07-05 [1] RSPM (R 4.4.0)
#>  ggplot2     * 3.5.1   2024-04-23 [1] RSPM (R 4.4.0)
#>  glue          1.7.0   2024-01-09 [1] RSPM (R 4.4.0)
#>  gtable        0.3.5   2024-04-22 [1] RSPM (R 4.4.0)
#>  hms           1.1.3   2023-03-21 [1] RSPM (R 4.4.0)
#>  htmltools     0.5.8.1 2024-04-04 [1] RSPM (R 4.4.0)
#>  knitr         1.46    2024-04-06 [1] RSPM (R 4.4.0)
#>  lifecycle     1.0.4   2023-11-07 [1] RSPM (R 4.4.0)
#>  lubridate   * 1.9.3   2023-09-27 [1] RSPM (R 4.4.0)
#>  magrittr      2.0.3   2022-03-30 [1] RSPM (R 4.4.0)
#>  munsell       0.5.1   2024-04-01 [1] RSPM (R 4.4.0)
#>  pillar        1.9.0   2023-03-22 [1] RSPM (R 4.4.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] RSPM (R 4.4.0)
#>  purrr       * 1.0.2   2023-08-10 [1] RSPM (R 4.4.0)
#>  R.cache       0.16.0  2022-07-21 [1] RSPM (R 4.4.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] RSPM (R 4.4.0)
#>  R.oo          1.26.0  2024-01-24 [1] RSPM (R 4.4.0)
#>  R.utils       2.12.3  2023-11-18 [1] RSPM (R 4.4.0)
#>  R6            2.5.1   2021-08-19 [1] RSPM (R 4.4.0)
#>  readr       * 2.1.5   2024-01-10 [1] RSPM (R 4.4.0)
#>  reprex        2.1.0   2024-01-11 [1] RSPM (R 4.4.0)
#>  rlang         1.1.3   2024-01-10 [1] RSPM (R 4.4.0)
#>  rmarkdown     2.27    2024-05-17 [1] RSPM (R 4.4.0)
#>  rstudioapi    0.16.0  2024-03-24 [1] RSPM (R 4.4.0)
#>  scales        1.3.0   2023-11-28 [1] RSPM (R 4.4.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] RSPM (R 4.4.0)
#>  stringi       1.8.4   2024-05-06 [1] RSPM (R 4.4.0)
#>  stringr     * 1.5.1   2023-11-14 [1] RSPM (R 4.4.0)
#>  styler        1.10.3  2024-04-07 [1] RSPM (R 4.4.0)
#>  tibble      * 3.2.1   2023-03-20 [1] RSPM (R 4.4.0)
#>  tidyr       * 1.3.1   2024-01-24 [1] RSPM (R 4.4.0)
#>  tidyselect    1.2.1   2024-03-11 [1] RSPM (R 4.4.0)
#>  tidyverse   * 2.0.0   2023-02-22 [1] RSPM (R 4.4.0)
#>  timechange    0.3.0   2024-01-18 [1] RSPM (R 4.4.0)
#>  tzdb          0.4.0   2023-05-12 [1] RSPM (R 4.4.0)
#>  utf8          1.2.4   2023-10-22 [1] RSPM (R 4.4.0)
#>  vctrs         0.6.5   2023-12-01 [1] RSPM (R 4.4.0)
#>  withr         3.0.0   2024-01-16 [1] RSPM (R 4.4.0)
#>  xfun          0.44    2024-05-15 [1] RSPM (R 4.4.0)
#>  yaml          2.3.8   2023-12-11 [1] RSPM (R 4.4.0)
#> 
#>  [1] /Users/timothychisamore/Library/R/x86_64/4.4/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

All the best,

Tim

lurkpa · June 3, 2024, 6:16pm

This worked perfectly. Thank you!

neale · June 4, 2024, 11:04pm

Great response, Tim! I did not realize that summarize() has a .by argument!

machupovirus · June 5, 2024, 12:03am

Hi @neale,

The .by argument is relatively new, I think I like it because it reduces the overhead of having to use group_by() and .drop in the summarize() or ungroup() to remove the group structure of the data. Accidentally leaving this grouping can cause chaos later on!

All the best,

Tim