Hello Sanja,
If I am interpreting your question correctly, it sounds like you will need to use survival analysis for this. Specifically, it sounds like COVID-19 diagnosis is your outcome and COVID-19 vaccination status is your predictor. Further, you are right-censoring using 2022-09-27 as the final date for follow-up, though, I would consider death as censoring as well, assuming it was not due to COVID-19.
Another issue you will have to contend with this is the potential for immortal time bias since the date of full vaccination seems to exceed the cancer diagnosis date so I would recommend measuring the time to event with respect to the full vaccination date rather than the date of cancer diagnosis.
I would begin by creating an indicator for infection status and an indicator for censoring (this will include cases that died prior to being diagnosed with COVID-19). Finally, you will then need to derive the time to event.
Here is some R code to demonstrate how you could do this, but I was unable to use your data since the field names and data itself did not display.
library(tidyverse)
fake_data |>
mutate(
is_vaccinated = if_else(
condition = !is.na(date_fully_vaccinated),
true = TRUE,
false = FALSE
),
# indicator for whether the individual was diagnosed with COVID-19
is_infected = if_else(
condition = !is.na(date_covid_diagnosis),
true = TRUE,
false = FALSE
),
# indicator for whether the individual died, we need this to calculate
# follow-up separately when an individual died prior to diagnosis
is_dead = if_else(
condition = !is.na(date_of_death),
true = TRUE,
false = FALSE
),
# indicator for whether the individual was censored
# this could include those who did not have COVID-19 by the end of
# observation or those who died prior to diagnosis of COVID-19
is_censored = !is_infected,
days_of_follow_up = case_when(
(is_vaccinated & is_infected) ~ lubridate::time_length(
x = lubridate::interval(
start = lubridate::ymd(date_fully_vaccinated),
end = lubridate::ymd(date_covid_diagnosis)
),
unit = "days"
),
(is_vaccinated &
!is_infected & is_dead) ~ lubridate::time_length(
x = lubridate::interval(
start = lubridate::ymd(date_fully_vaccinated),
end = lubridate::ymd(date_of_death)
),
unit = "days"
),
(is_vaccinated &
!is_infected & !is_dead) ~ lubridate::time_length(
x = lubridate::interval(
start = lubridate::ymd(date_fully_vaccinated),
end = lubridate::ymd("2022-09-27")
),
unit = "days"
),
(!is_vaccinated & is_infected) ~ lubridate::time_length(
x = lubridate::interval(
start = lubridate::ymd(date_cancer_diagnosis),
end = lubridate::ymd(date_covid_diagnosis)
),
unit = "days"
),
(!is_vaccinated &
!is_infected & is_dead) ~ lubridate::time_length(
x = lubridate::interval(
start = lubridate::ymd(date_cancer_diagnosis),
end = lubridate::ymd(date_of_death)
),
unit = "days"
),
(!is_vaccinated &
!is_infected & !is_dead) ~ lubridate::time_length(
x = lubridate::interval(
start = lubridate::ymd(date_cancer_diagnosis),
end = lubridate::ymd("2022-09-27")
),
unit = "days"
)
)
)
Once you have the data in the right format, you can then use survival analysis techniques to calculate the cumulative incidence/hazard and thus the relative risks.
EDIT: I forgot to mention, you should likely also add some buffer to the date of full vaccination as an individual will not be immune instantaneously. Rather, there is probably some length of time after this date where a diagnosis should be attributed to the non-vaccinated stratum. You would need to find this from literature if you haven’t done so already.
All the best,
Tim