Applied Epi Community

Cleaning a date column with multiple formats

Please ensure you have done the following before posting:
[/] I have read these instructions and this guidance on how to produce a minimal reproducible example.

[/] I have searched this forum and stackoverflow for solutions

Provide an overview of your problem

  • I regularly receive an excel file with dates in multiple formats found in a single column (excel date i.e. 44663, MM-DD-YY, MM/DD/YY, MM/DD/YYYY, etc).

Describe your issue in detail

  • I searched and found that the Janitor convert_to_date function might serve this purpose, but I can’t seem to make it work.
  • This is the typical linelist that I receive.
  • Timeline/urgency
  • (No sensitive or identifiable information)

What steps have you already taken to find an answer?

I tried doing the following: linelist_clean <- convert_to_date(linelist2$date_birth)

But I get the following error:

Error in convert_to_datetime_helper.character(x, …, character_fun = character_fun, :
Not all character strings converted to class Date. Values not converted were: β€œ01-19-2000”, β€œ10/16/2018”, β€œ8-28-2021”, β€œ01-29-1971”, β€œ07-25-2018”, β€œ02/28/2002”, β€œ03/28/2008”, β€œ04-28-2001”, β€œ05/21/1993” … and 41 other values.
In addition: Warning message:
All formats failed to parse. No formats found.

I hope I attached a reproducible example the right way since this is my first time. Looking forward to your suggestions!

A tibble: 5 x 6

street age_years date_birth date_admit date_onset lab_res

1 ACACIA 39 30345 44663 44662 NS1-AG POSITIVE
2 ACACIA 35 31842 44685 44684 NS1-AG POSITIVE
3 SUN FLOWER 22 01/19/2000 05/18/2022 44900 NS1 NEGATIVE
4 KASOY 3 43355 44625 44597 NS1-AG POSITIVE
5 7 2 43663 44688 44685 NS1 NEGATIVE

Hi @iancgmd - Welcome to the forum!
Dates can be a very painful thing to deal with in R especially when they come in different messy formats, but luckily there’s quite a few packages that deal with this.
Janitor only really works if when all of the dates are the same excel date format (the number type e.g. your 4463 above).
This blog post gives a really nice overview of the various ways of dealing with messy dates.

I would suggest the parsedate package - it does a relatively good job of guessing and doesn’t need you to do any defining.
Otherwise the lubridate package is quite good but needs you to define all the different formats that dates show up in (which is fine if you know this, but less great if people are adding new messy ways each week!).

Another thing to mention is that the Rio package for reading in Excel sheets does a pretty good job of finding date columns.

Hope this helps

2 Likes

Thank you, @aspina ! Thank you for referring me to the parsedate package! I think it will do the job, however I’m not quite sure how to execute the package in the cleaning pipechain. I’m learning R by myself and I’m trying to apply the cleaning data and core functions chapter of the handbook to my dataset. Suppose I have done the following:

pipe the raw dataset through the function clean_names(), assign result as β€œlinelist”

linelist_clean ← linelist_raw %>%
janitor::clean_names() %>%

manual name changes. new column name is given before the old column name.

  #new name      #old name

rename(name_first = first_name,
name_last = family_name,
date_birth = dob,
date_admit = d_admit,
date_onset = d_onset,
name_dru = name_of_dru) %>%

remove 100% duplicates

distinct()

How would I then incorporate the parsedate package to clean the date_birth, date_admit, and date_onset columns respectively?

hi @iancgmd sorry for the delay, you could use the across() function within a mutate() to run parsedate over each variable. See below (let me know if this throws an error as I havent run the code)

# pipe the raw dataset through the function clean_names(), assign result as β€œlinelist”

linelist_clean ← linelist_raw %>%
janitor::clean_names() %>%

# manual name changes. new column name is given before the old column name.
  #new name      #old name

rename(name_first = first_name,
name_last = family_name,
date_birth = dob,
date_admit = d_admit,
date_onset = d_onset,
name_dru = name_of_dru) %>%

# remove 100% duplicates

distinct() %>% 

# fix date variables one by one 
mutate(
      across(.cols = all_of(date_birth, date_admit, date_onset), 
      ~ parse_date
)
)
2 Likes

Thank you for this! I tried running it but I get the following error:

Error in across():
! Must be used inside dplyr verbs.
Run rlang::last_error() to see where the error occurred.

1 Like

hmmm looks like you might not have run across() within the brackets of mutate()? Or did you forget the %>% to include the data from the previous step?

1 Like

I tried copy-pasting the code (variable names have changed slightly to DOB, DAdmit, DOnset), and got the following error:

Error in mutate():
! Problem while computing ..1 = across(.cols = all_of(DOB, DAdmit, DOnset), ~parse_date).
Caused by error in across():
! unused arguments (DAdmit, DOnset)
Run rlang::last_error() to see where the error occurred

1 Like

oops sorry!
try this - instead of all_of() use c() and remove the ~

# fix date variables one by one 
mutate(
      across(.cols = c(DOB, DAdmit, DOnset), 
      parse_date
)
)
2 Likes

Thanks! The code finally ran, but I got the following warning after:

Warning: Problem while computing ..1 = across(.cols = c(DOB, DAdmit, DOnset), parse_date).
i 167 parsing failures. row col expected actual 1 – date like 30345 2 – date like 31842 3 – date like 01-19-2000 4
– date like 43355 5 – date like 43663 … … … … See problems(…) for more details.
Warning: Problem while computing ..1 = across(.cols = c(DOB, DAdmit, DOnset), parse_date).
i 171 parsing failures. row col expected actual 1 – date like 44663 2 – date like 44685 3 – date like 05-18-2022 4
– date like 44625 5 – date like 44688 … … … … See problems(…) for more details.
Warning: Problem while computing ..1 = across(.cols = c(DOB, DAdmit, DOnset), parse_date).
i 170 parsing failures. row col expected actual 1 – date like 44662 2 – date like 44684 3 – date like 44900 4 –
date like 44597 5 – date like 44685 … … … … See problems(…) for more details.

Then when I look at the linelist data frame, the 3 date columns all have NA.

Update, I still haven’t figured out how to go about this.

Hello! In case anyone encounters a similar issue, the following ultimately worked for me:

mutate(date_onset = parse_date_time(date_onset, orders =c(β€œmdy”, β€œdBY”, β€œmdY”))) %>%
mutate(date_onset = format(date_onset, β€œ%Y-%m-%d”))

1 Like