Parse Messy Dates from an Object Column and Replace with Formatted Dates

Hi! I’m trying to make a script for cleaning messy dates. Basically, I have a linelist of deaths and it has a column called “Date of Death”. This column contains dates that are not uniformly formatted so I’d like to parse and format the dates into a uniform date format however, I’m getting the following error (cleanlist object not found) however, when I manually try to run the cleanlist code, it runs without any issues. I just get the issue below if I try to run the whole script. Also, I want to know if how would I properly use the mutate function to update the date_of_death column in the clean_list table/object. Thank you!

# Loading packages --------------------------------------------------------

# Checks if package is installed, installs if necessary, and loads package for current session

pacman::p_load(
  lubridate,  # general package for handling and converting dates  
  parsedate,  # has function to "guess" messy dates
  here,       # file management
  rio,        # data import/export
  janitor,    # data cleaning
  epikit,     # for creating age categories
  tidyverse,  # data management and visualization 
  magrittr,
  dplyr,
  reprex,     # minimal repr example
  datapasta   # sample data
)

datecolumn <- "date_of_death"
  
# Data Import -------------------------------------------------------------

  linelist <- data.frame(
    stringsAsFactors = FALSE,
    check.names = FALSE,
    `Date of Death` = c("45236","45212","45152",
                        "JANUARY 19, 2023","June 25, 2023","45200","45164",
                        "5/16/2023","45277","44930"))
  
# Data Cleaning -----------------------------------------------------------

cleanlist <- linelist %>%      # the raw dataset
  clean_names() %>%            # automatically clean column names
  
# Format Dates ------------------------------------------------------------
  
  # parse the date values
  mutate(parsedate::parse_date(cleanlist[[datecolumn]]))
#> Error in `mutate()`:
#> ℹ In argument: `parsedate::parse_date(cleanlist[[datecolumn]])`.
#> Caused by error:
#> ! object 'cleanlist' not found

Created on 2024-08-12 with reprex v2.1.1

1 Like

Hello,

This is how I would approach the problem using the parsedate package:

# loading packages
library(tidyverse)
library(janitor)
#> 
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#> 
#>     chisq.test, fisher.test
library(parsedate)
#> 
#> Attaching package: 'parsedate'
#> The following object is masked from 'package:readr':
#> 
#>     parse_date

# creating fake data
linelist <- data.frame(
    stringsAsFactors = FALSE,
    check.names = FALSE,
    `Date of Death` = c(
        "45236",
        "45212",
        "45152",
        "JANUARY 19, 2023",
        "June 25, 2023",
        "45200",
        "45164",
        "5/16/2023",
        "45277",
        "44930"
    )
) |>
    as_tibble()

# cleaning data
cleanlist <- linelist |>
    clean_names()
    
# formatting date
cleanlist |>
    mutate(date_of_death = parsedate::parse_date(date_of_death)) |>
    slice_head(n = 10)
#> # A tibble: 10 × 1
#>    date_of_death      
#>    <dttm>             
#>  1 2024-08-12 08:07:45
#>  2 2024-08-12 08:07:45
#>  3 2024-08-12 08:07:45
#>  4 2023-01-19 00:00:00
#>  5 2023-06-25 00:00:00
#>  6 2024-08-12 08:07:45
#>  7 2024-08-12 08:07:45
#>  8 2023-05-16 00:00:00
#>  9 2024-08-12 08:07:45
#> 10 2024-08-12 08:07:45

Created on 2024-08-12 with reprex v2.1.1

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.1 (2024-06-14)
#>  os       macOS Sonoma 14.5
#>  system   x86_64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Toronto
#>  date     2024-08-12
#>  pandoc   3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/x86_64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  cli           3.6.3   2024-06-21 [1] RSPM (R 4.4.0)
#>  colorspace    2.1-0   2023-01-23 [1] RSPM (R 4.4.0)
#>  digest        0.6.36  2024-06-23 [1] RSPM (R 4.4.0)
#>  dplyr       * 1.1.4   2023-11-17 [1] RSPM (R 4.4.0)
#>  evaluate      0.24.0  2024-06-10 [1] RSPM (R 4.4.0)
#>  fansi         1.0.6   2023-12-08 [1] RSPM (R 4.4.0)
#>  fastmap       1.2.0   2024-05-15 [1] RSPM (R 4.4.0)
#>  forcats     * 1.0.0   2023-01-29 [1] RSPM (R 4.4.0)
#>  fs            1.6.4   2024-04-25 [1] RSPM (R 4.4.0)
#>  generics      0.1.3   2022-07-05 [1] RSPM (R 4.4.0)
#>  ggplot2     * 3.5.1   2024-04-23 [1] RSPM (R 4.4.0)
#>  glue          1.7.0   2024-01-09 [1] RSPM (R 4.4.0)
#>  gtable        0.3.5   2024-04-22 [1] RSPM (R 4.4.0)
#>  hms           1.1.3   2023-03-21 [1] RSPM (R 4.4.0)
#>  htmltools     0.5.8.1 2024-04-04 [1] RSPM (R 4.4.0)
#>  janitor     * 2.2.0   2023-02-02 [1] CRAN (R 4.4.0)
#>  knitr         1.48    2024-07-07 [1] RSPM (R 4.4.0)
#>  lifecycle     1.0.4   2023-11-07 [1] RSPM (R 4.4.0)
#>  lubridate   * 1.9.3   2023-09-27 [1] RSPM (R 4.4.0)
#>  magrittr      2.0.3   2022-03-30 [1] RSPM (R 4.4.0)
#>  munsell       0.5.1   2024-04-01 [1] RSPM (R 4.4.0)
#>  parsedate   * 1.3.1   2022-10-27 [1] CRAN (R 4.4.0)
#>  pillar        1.9.0   2023-03-22 [1] RSPM (R 4.4.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] RSPM (R 4.4.0)
#>  purrr       * 1.0.2   2023-08-10 [1] RSPM (R 4.4.0)
#>  R6            2.5.1   2021-08-19 [1] RSPM (R 4.4.0)
#>  readr       * 2.1.5   2024-01-10 [1] RSPM (R 4.4.0)
#>  reprex        2.1.1   2024-07-06 [1] RSPM (R 4.4.0)
#>  rlang         1.1.4   2024-06-04 [1] RSPM (R 4.4.0)
#>  rmarkdown     2.27    2024-05-17 [1] RSPM (R 4.4.0)
#>  rstudioapi    0.16.0  2024-03-24 [1] RSPM (R 4.4.0)
#>  scales        1.3.0   2023-11-28 [1] RSPM (R 4.4.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] RSPM (R 4.4.0)
#>  snakecase     0.11.1  2023-08-27 [1] CRAN (R 4.4.0)
#>  stringi       1.8.4   2024-05-06 [1] RSPM (R 4.4.0)
#>  stringr     * 1.5.1   2023-11-14 [1] RSPM (R 4.4.0)
#>  tibble      * 3.2.1   2023-03-20 [1] RSPM (R 4.4.0)
#>  tidyr       * 1.3.1   2024-01-24 [1] RSPM (R 4.4.0)
#>  tidyselect    1.2.1   2024-03-11 [1] RSPM (R 4.4.0)
#>  tidyverse   * 2.0.0   2023-02-22 [1] RSPM (R 4.4.0)
#>  timechange    0.3.0   2024-01-18 [1] RSPM (R 4.4.0)
#>  tzdb          0.4.0   2023-05-12 [1] RSPM (R 4.4.0)
#>  utf8          1.2.4   2023-10-22 [1] RSPM (R 4.4.0)
#>  vctrs         0.6.5   2023-12-01 [1] RSPM (R 4.4.0)
#>  withr         3.0.0   2024-01-16 [1] RSPM (R 4.4.0)
#>  xfun          0.45    2024-06-16 [1] RSPM (R 4.4.0)
#>  yaml          2.3.9   2024-07-05 [1] RSPM (R 4.4.0)
#> 
#>  [1] /Users/timothychisamore/Library/R/x86_64/4.4/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

All the best,

Tim

1 Like

Thanks for the response, Tim. I’ll try this out :slight_smile:

1 Like

Hello,

Just to be clear, I only used the slice_head function to print a sample of the data into the console, you wouldn’t use this in the actual process of formatting your dates. Let me know if you have any questions!

All the best,

Tim

Yeah, I tried the code without the slice_head function and it worked fine however, there seems to be a problem in parsing the Excel dates. Ex. the ‘45236’ value was converted to “2024-08-12” but the correct date should be “2023-11-06”

1 Like

Hello,

I think that is a common issue with Excel numeric dates, I know the janitor package has the function excel_numeric_to_date() that might be useful for you.

All the best,

Tim

1 Like

hello - this thread might be useful to you

2 Likes

Thank you for sharing this! I’ll have a look.

1 Like