Please ensure you have done the following before posting:
[/] I have read these instructions and this guidance on how to produce a minimal reproducible example.
[/] I have searched this forum and stackoverflow for solutions
Provide an overview of your problem
I regularly receive an excel file with dates in multiple formats found in a single column (excel date i.e. 44663, MM-DD-YY, MM/DD/YY, MM/DD/YYYY, etc).
Describe your issue in detail
I searched and found that the Janitor convert_to_date function might serve this purpose, but I can’t seem to make it work.
This is the typical linelist that I receive.
Timeline/urgency
(No sensitive or identifiable information)
What steps have you already taken to find an answer?
I tried doing the following: linelist_clean <- convert_to_date(linelist2$date_birth)
But I get the following error:
Error in convert_to_datetime_helper.character(x, …, character_fun = character_fun, :
Not all character strings converted to class Date. Values not converted were: “01-19-2000”, “10/16/2018”, “8-28-2021”, “01-29-1971”, “07-25-2018”, “02/28/2002”, “03/28/2008”, “04-28-2001”, “05/21/1993” … and 41 other values.
In addition: Warning message:
All formats failed to parse. No formats found.
I hope I attached a reproducible example the right way since this is my first time. Looking forward to your suggestions!
A tibble: 5 x 6
street age_years date_birth date_admit date_onset lab_res
Hi @iancgmd - Welcome to the forum!
Dates can be a very painful thing to deal with in R especially when they come in different messy formats, but luckily there’s quite a few packages that deal with this.
Janitor only really works if when all of the dates are the same excel date format (the number type e.g. your 4463 above).
This blog post gives a really nice overview of the various ways of dealing with messy dates.
I would suggest the parsedate package - it does a relatively good job of guessing and doesn’t need you to do any defining.
Otherwise the lubridate package is quite good but needs you to define all the different formats that dates show up in (which is fine if you know this, but less great if people are adding new messy ways each week!).
Another thing to mention is that the Rio package for reading in Excel sheets does a pretty good job of finding date columns.
Thank you, @aspina ! Thank you for referring me to the parsedate package! I think it will do the job, however I’m not quite sure how to execute the package in the cleaning pipechain. I’m learning R by myself and I’m trying to apply the cleaning data and core functions chapter of the handbook to my dataset. Suppose I have done the following:
pipe the raw dataset through the function clean_names(), assign result as “linelist”
hi @iancgmd sorry for the delay, you could use the across() function within a mutate() to run parsedate over each variable. See below (let me know if this throws an error as I havent run the code)
# pipe the raw dataset through the function clean_names(), assign result as “linelist”
linelist_clean ← linelist_raw %>%
janitor::clean_names() %>%
# manual name changes. new column name is given before the old column name.
#new name #old name
rename(name_first = first_name,
name_last = family_name,
date_birth = dob,
date_admit = d_admit,
date_onset = d_onset,
name_dru = name_of_dru) %>%
# remove 100% duplicates
distinct() %>%
# fix date variables one by one
mutate(
across(.cols = all_of(date_birth, date_admit, date_onset),
~ parse_date
)
)
I tried copy-pasting the code (variable names have changed slightly to DOB, DAdmit, DOnset), and got the following error:
Error in mutate():
! Problem while computing ..1 = across(.cols = all_of(DOB, DAdmit, DOnset), ~parse_date).
Caused by error in across():
! unused arguments (DAdmit, DOnset)
Run rlang::last_error() to see where the error occurred
Thanks! The code finally ran, but I got the following warning after:
Warning: Problem while computing ..1 = across(.cols = c(DOB, DAdmit, DOnset), parse_date).
i 167 parsing failures. row col expected actual 1 – date like 30345 2 – date like 31842 3 – date like 01-19-2000 4
– date like 43355 5 – date like 43663 … … … … See problems(…) for more details.
Warning: Problem while computing ..1 = across(.cols = c(DOB, DAdmit, DOnset), parse_date).
i 171 parsing failures. row col expected actual 1 – date like 44663 2 – date like 44685 3 – date like 05-18-2022 4
– date like 44625 5 – date like 44688 … … … … See problems(…) for more details.
Warning: Problem while computing ..1 = across(.cols = c(DOB, DAdmit, DOnset), parse_date).
i 170 parsing failures. row col expected actual 1 – date like 44662 2 – date like 44684 3 – date like 44900 4 –
date like 44597 5 – date like 44685 … … … … See problems(…) for more details.
Then when I look at the linelist data frame, the 3 date columns all have NA.