Creating New Numeric Variable based on a mathematical expression

Hi, I am trying to create a numeric variable (converting info to number of days) using case_when based on two columns. One of the variables where I have to apply a formula has a few characters (other than numeric values).
I think, My script is failing to apply mathematical formula (multiplication) for the reason the column is of class “character”. If I try to force it to as numeric then lots of values turn to “NA”

Can someone help me figure out, how to get rid of this error?

Error in mutate():
:information_source: In argument: New_variable= case_when(...).
Caused by error in case_when():
! Failed to evaluate the right-hand side of formula 1.
Caused by error in ExistingVAR_numeric * 30:
! non-numeric argument to binary operator
Run rlang::last_trace() to see where the error occurred.

Data%<>%
mutate(New_variable= case_when(
  
  ExistingVAR_unit == "month"              ~  ExistingVAR_numeric *  30,
  ExistingVAR_unit == "Month"              ~ ExistingVAR_numeric *  30,
  ExistingVAR_unit == "Month(XXX)"    ~  ExistingVAR_numeric *  30,
  
  ExistingVAR_unit == "Year"                ~  ExistingVAR_numeric *  365,
  
  ExistingVAR_numeric == "0"       ~ ExistingVAR_numeric ,
  ExistingVAR_numeric == "NO"    ~ "NA",
  ExistingVAR_numeric == "na"    ~  "NA",
  
  ExistingVAR_unit == ""        ~ ExistingVAR_numeric ,
  is.na(ExistingVAR_unit )       ~ ExistingVAR_numeric ))

Thank you for looking into this!

1 Like

Hi @shoaibraee , thanks for posting again! I’ll try to help.

In the future, I’ll be able to answer more quickly if you can include a small piece of your data too - see this post on how to use the {datapasta} r package to do that.

Now to your case_when() issue:

  • case_when() expects all possible values on the right side of the ~ to be of the same class. Assuming that your ExistingVAR_numeric is class numeric, the 6th and 7th options you have written for “NO” and “na” that assign the character value “NA” will cause a problem because “NA” is class character, not numeric like the other options.

  • Try removing the 6th and 7th options from your command. case_when() will default to assigning “NO” and “na” (and anything else not addressed in your command) to NA Best of all, it will automatically use the NA of the correct class (yes, annoyingly there are variations of NA with different classes, such as NA_character. You can read more about this here.)

Does that work?

Finally, it is best to build your case_when() command one step at a time. Add a line, run the command, look at your data to see the changes. If you get an error, then you will know exactly which line in the case_when() command caused the error.

2 Likes

@shoaibraee if the answer helped solve your problem, please reply and/or mark it as the solution.
Or if you need further help just say so.
Thanks!

1 Like

Hi @neale. Thank you for the help. If i remove the multiplication, It works fine for the case_when recoding to NA. However, keeping the multiplication (* 30 or *365) gives an error as below. I think it is due to the fact that variable is still character. Any thoughts on that? Thank you!

Error in mutate():
:information_source: In argument: Amenorrhea_duration_estimated = case_when(...).
Caused by error in case_when():
! Failed to evaluate the right-hand side of formula 1.
Caused by error in Amenorrhea_duration_numeric * 30:
! non-numeric argument to binary operator

1 Like

@shoaibraee Yes it would seem that that column is not numeric. I suggest to use class() to ensure the class is numeric right before this case_when() error.

At this point, I will be able to help you if you can share a small piece of your data and the code leading to this problem. See this post and this video to help you.

In brief: Take your dataset right before this problem occurs. Keep only the top 10 rows using head(10) and only the relevant columns for the error (using select()) and give it a name such as test_data. Then run dpasta(test_data) from the {datapasta} package. This will provide code that you can paste here so that we can help fix your problem.

The next step is to install/load the {reprex} package. As explained in the video, highlight your code from the test_data object to the point of the error, and run the reprex() function. Paste the result into this thread.

Best of luck! Let us know if anything can be clarified.

2 Likes
pacman::p_load(rio, lubridate, datapasta, reprex, tidyverse) 

demo_data <- tibble::tribble(~table2_dur_of_amenorrhea_female, ~table2_vomiting3,
                                                              "0",               "0",
                                                               NA,               "0",
                                                               NA,               "0",
                                                              "0",               "0",
                                                               NA,               "0",
                                                               NA,               "0",
                                                              "0",               "0",
                                                              "0",               "0",
                                                               NA,               "0",
                                                               NA,               "0",
                                                               NA,               "0",
                                                               NA,               "0",
                                                               NA,               "0",
                                                               NA,               "0",
                                                              "0",               "0",
                                                               NA,               "0",
                                                               NA,               "0",
                                                               NA,               "0",
                                                               NA,         "2 weeks",
                                                              "0",               "0"
                                 )
# Checking the class of columns
class(demo_data$table2_dur_of_amenorrhea_female)
#> [1] "character"
class(demo_data$table2_vomiting3)
#> [1] "character"

# Splitting column into two and converting to days
# QUESTION: can we also split the other variable "table2_vomiting3" in the same line below 
Amnr_test <-  str_split_fixed(demo_data$table2_dur_of_amenorrhea_female , " "  , 2)

# Binding demo data and newly created variables 
demo_data <- bind_cols(demo_data , Amnr_test)
#> New names:
#> • `` -> `...3`
#> • `` -> `...4`
names(demo_data)
#> [1] "table2_dur_of_amenorrhea_female" "table2_vomiting3"               
#> [3] "...3"                            "...4"

# Renaming the newly created split columns 
names(demo_data)[3:4] = c("Amnr_test_numeric" , "Amnr_test_unit")

# Cross checking that newly created naming went well
names(demo_data)
#> [1] "table2_dur_of_amenorrhea_female" "table2_vomiting3"               
#> [3] "Amnr_test_numeric"               "Amnr_test_unit"


# Question: Please see if you recommend this code to change the class. I found it online. But this cannot force character to numeric if there are characters entered among the numeric variables.
#demo_data <- type.convert(demo_data , as.is=TRUE)

unique(demo_data$Amnr_test_numeric) 
#> [1] "0" NA
unique(demo_data$Amnr_test_unit) 
#> [1] ""

demo_data %>% 
  mutate(Amnr_test_new = case_when(

    Amnr_test_unit == "month"              ~ Amnr_test_numeric * 30  ,
    Amnr_test_unit == "Month"              ~ Amnr_test_numeric * 30  ,
    Amnr_test_unit == "Month(menopause)"   ~ Amnr_test_numeric * 30  ,
    Amnr_test_unit == "(post menopausal)"  ~ Amnr_test_numeric       ,
    
    Amnr_test_unit == "Year"               ~ Amnr_test_numeric  *365 ,
    
    # Amnr_test_numeric == "NO"    ~ "NA",
    # Amnr_test_numeric ==    NA      ~ NA ,
    # Amnr_test_numeric == "na"    ~ "NA",
    
    # Amnr_test_unit == ""        ~ Amnr_test_numeric,
    # is.na(Amnr_test_unit)       ~ Amnr_test_numeric
  ))
#> Error in `mutate()`:
#> ℹ In argument: `Amnr_test_new = case_when(...)`.
#> Caused by error in `case_when()`:
#> ! Failed to evaluate the right-hand side of formula 1.
#> Caused by error in `Amnr_test_numeric * 30`:
#> ! non-numeric argument to binary operator
#> Backtrace:
#>      ▆
#>   1. ├─demo_data %>% ...
#>   2. ├─dplyr::mutate(...)
#>   3. ├─dplyr:::mutate.data.frame(...)
#>   4. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), by)
#>   5. │   ├─base::withCallingHandlers(...)
#>   6. │   └─dplyr:::mutate_col(dots[[i]], data, mask, new_columns)
#>   7. │     └─mask$eval_all_mutate(quo)
#>   8. │       └─dplyr (local) eval()
#>   9. ├─dplyr::case_when(...)
#>  10. │ └─dplyr:::case_formula_evaluate(...)
#>  11. │   ├─base::withCallingHandlers(...)
#>  12. │   └─rlang::eval_tidy(pair$rhs, env = default_env)
#>  13. └─base::.handleSimpleError(...)
#>  14.   └─dplyr (local) h(simpleError(msg, call))
#>  15.     └─rlang::abort(message, parent = cnd, call = error_call)

Created on 2023-10-13 with reprex v2.0.2

Session info
sessionInfo()
#> R version 4.3.1 (2023-06-16 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=Norwegian Bokmål_Norway.utf8 
#> [2] LC_CTYPE=Norwegian Bokmål_Norway.utf8   
#> [3] LC_MONETARY=Norwegian Bokmål_Norway.utf8
#> [4] LC_NUMERIC=C                            
#> [5] LC_TIME=Norwegian Bokmål_Norway.utf8    
#> 
#> time zone: Europe/Paris
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] forcats_1.0.0   stringr_1.5.0   dplyr_1.1.2     purrr_1.0.1    
#>  [5] readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.2  
#>  [9] tidyverse_2.0.0 reprex_2.0.2    datapasta_3.1.0 lubridate_1.9.2
#> [13] rio_0.5.29     
#> 
#> loaded via a namespace (and not attached):
#>  [1] utf8_1.2.3        generics_0.1.3    stringi_1.7.12    hms_1.1.3        
#>  [5] digest_0.6.33     magrittr_2.0.3    evaluate_0.21     grid_4.3.1       
#>  [9] timechange_0.2.0  fastmap_1.1.1     cellranger_1.1.0  zip_2.3.0        
#> [13] fansi_1.0.4       scales_1.2.1      cli_3.6.1         rlang_1.1.1      
#> [17] munsell_0.5.0     withr_2.5.0       yaml_2.3.7        tools_4.3.1      
#> [21] tzdb_0.4.0        colorspace_2.1-0  pacman_0.5.1      curl_5.0.1       
#> [25] vctrs_0.6.3       R6_2.5.1          lifecycle_1.0.3   fs_1.6.3         
#> [29] foreign_0.8-84    pkgconfig_2.0.3   pillar_1.9.0      openxlsx_4.2.5.2 
#> [33] gtable_0.3.3      data.table_1.14.8 glue_1.6.2        Rcpp_1.0.11      
#> [37] haven_2.5.3       xfun_0.39         tidyselect_1.2.0  rstudioapi_0.15.0
#> [41] knitr_1.43        htmltools_0.5.5   rmarkdown_2.23    compiler_4.3.1   
#> [45] readxl_1.4.3
1 Like

Hi @shoaibraee, thanks for the reproducible example - it is very helpful. Below are two things that may help you:

1) I inserted a line using as.numeric() to convert your column to class numeric, right before the case_when() command. The command now runs without error. There was no data in the amnr_test_unit column however, so it is not obvious what the result will be.

pacman::p_load(rio, lubridate, datapasta, reprex, tidyverse) 

demo_data <- tibble::tribble(~table2_dur_of_amenorrhea_female, ~table2_vomiting3,
                             "0",               "0",
                             NA,               "0",
                             NA,               "0",
                             "0",               "0",
                             NA,               "0",
                             NA,               "0",
                             "0",               "0",
                             "0",               "0",
                             NA,               "0",
                             NA,               "0",
                             NA,               "0",
                             NA,               "0",
                             NA,               "0",
                             NA,               "0",
                             "0",               "0",
                             NA,               "0",
                             NA,               "0",
                             NA,               "0",
                             NA,         "2 weeks",
                             "0",               "0"
)
# Checking the class of columns
class(demo_data$table2_dur_of_amenorrhea_female)
#> [1] "character"
class(demo_data$table2_vomiting3)
#> [1] "character"

# Splitting column into two and converting to days
# QUESTION: can we also split the other variable "table2_vomiting3" in the same line below 
Amnr_test <-  str_split_fixed(demo_data$table2_dur_of_amenorrhea_female , " "  , 2)

# Binding demo data and newly created variables 
demo_data <- bind_cols(demo_data , Amnr_test)
#> New names:
#> • `` -> `...3`
#> • `` -> `...4`

names(demo_data)
#> [1] "table2_dur_of_amenorrhea_female" "table2_vomiting3"               
#> [3] "...3"                            "...4"
                          
# Renaming the newly created split columns 
names(demo_data)[3:4] = c("Amnr_test_numeric" , "Amnr_test_unit")

# Cross checking that newly created naming went well
names(demo_data)
#> [1] "table2_dur_of_amenorrhea_female" "table2_vomiting3"               
#> [3] "Amnr_test_numeric"               "Amnr_test_unit"

# Question: Please see if you recommend this code to change the class. I found it online. But this cannot force character to numeric if there are characters entered among the numeric variables.
#demo_data <- type.convert(demo_data , as.is=TRUE)

unique(demo_data$Amnr_test_numeric) 
#> [1] "0" NA
unique(demo_data$Amnr_test_unit) 
#> [1] ""

class(demo_data$Amnr_test_numeric)
#> [1] "character"

# !!! Add this step to convert the column call from character to numeric
demo_data <- demo_data %>% 
  mutate(Amnr_test_numeric = as.numeric(Amnr_test_numeric))

demo_data %>% 
  mutate(Amnr_test_new = case_when(
    
    Amnr_test_unit == "month"              ~ Amnr_test_numeric * 30  ,
    Amnr_test_unit == "Month"              ~ Amnr_test_numeric * 30  ,
    Amnr_test_unit == "Month(menopause)"   ~ Amnr_test_numeric * 30  ,
    Amnr_test_unit == "(post menopausal)"  ~ Amnr_test_numeric       ,
    
    Amnr_test_unit == "Year"               ~ Amnr_test_numeric  *365 ,
    
    # Amnr_test_numeric == "NO"    ~ "NA",
    # Amnr_test_numeric ==    NA      ~ NA ,
    # Amnr_test_numeric == "na"    ~ "NA",
    
    # Amnr_test_unit == ""        ~ Amnr_test_numeric,
    # is.na(Amnr_test_unit)       ~ Amnr_test_numeric
  ))
#> # A tibble: 20 × 5
#>    table2_dur_of_amenorrhea_…¹ table2_vomiting3 Amnr_test_numeric Amnr_test_unit
#>    <chr>                       <chr>                        <dbl> <chr>         
#>  1 0                           0                                0 ""            
#>  2 <NA>                        0                               NA ""            
#>  3 <NA>                        0                               NA ""            
#>  4 0                           0                                0 ""            
#>  5 <NA>                        0                               NA ""            
#>  6 <NA>                        0                               NA ""            
#>  7 0                           0                                0 ""            
#>  8 0                           0                                0 ""            
#>  9 <NA>                        0                               NA ""            
#> 10 <NA>                        0                               NA ""            
#> 11 <NA>                        0                               NA ""            
#> 12 <NA>                        0                               NA ""            
#> 13 <NA>                        0                               NA ""            
#> 14 <NA>                        0                               NA ""            
#> 15 0                           0                                0 ""            
#> 16 <NA>                        0                               NA ""            
#> 17 <NA>                        0                               NA ""            
#> 18 <NA>                        0                               NA ""            
#> 19 <NA>                        2 weeks                         NA ""            
#> 20 0                           0                                0 ""            
#> # ℹ abbreviated name: ¹​table2_dur_of_amenorrhea_female
#> # ℹ 1 more variable: Amnr_test_new <dbl>
攀x
#> Error in eval(expr, envir, enclos): object '攀x' not found

Created on 2023-10-14 with reprex v2.0.2

2) I tried to understand what you want to do, and provide some alternative code here to split a column using separate(), and then convert to days using your case_when code. I hope it is helpful and perhaps more efficient.

pacman::p_load(rio, lubridate, datapasta, reprex, tidyverse) 

demo_data <- tibble::tribble(~table2_vomiting3,
                                    "0",
                                   "0",
                                   "0",
                                    "0",
                                   "0",
                                   "0",
                                    "0",
                                    "0",
                                   "0",
                                   "0",
                                   "0",
                                   "0",
                                   "0",
                                   "0",
                                    "0",
                                   "0",
                                   "0",
                                   "0",
                             "2 weeks",
                                    "0"
)

# Split a column into two other columns
demo_data <- demo_data %>%                       # re-define your dataset
  separate(col = table2_vomiting3,               # the column to split
           into = c("vom_duration", "vom_unit"), # create two new columns
           sep = " ",                            # split the value at space
           extra = "merge",                      # if more values then expected, keep
           remove = FALSE                        # keep the original column for records
           ) %>% 
  mutate(vom_duration = as.numeric(vom_duration)) # convert duration column to numeric
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 19 rows [1, 2,
#> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20].

# Apply your case_when() logic to result in days
demo_data <- demo_data %>% 
  mutate(vom_duration_days = case_when(
    vom_unit %in% c("week", "weeks")                    ~ vom_duration * 7,  # I added this about week - please check if you want to keep

    # Look for the word "month"
    str_detect(vom_unit, "month|Month|months|Months")   ~ vom_duration * 30, # be careful using 30 as a proxy for month! Not exact
    vom_unit == "(post menopausal)"                     ~ vom_duration,      # I am not sure what you are doing here
    vom_unit == "Year"                                  ~ vom_duration * 365 
  ))

demo_data
#> # A tibble: 20 × 4
#>    table2_vomiting3 vom_duration vom_unit vom_duration_days
#>    <chr>                   <dbl> <chr>                <dbl>
#>  1 0                           0 <NA>                    NA
#>  2 0                           0 <NA>                    NA
#>  3 0                           0 <NA>                    NA
#>  4 0                           0 <NA>                    NA
#>  5 0                           0 <NA>                    NA
#>  6 0                           0 <NA>                    NA
#>  7 0                           0 <NA>                    NA
#>  8 0                           0 <NA>                    NA
#>  9 0                           0 <NA>                    NA
#> 10 0                           0 <NA>                    NA
#> 11 0                           0 <NA>                    NA
#> 12 0                           0 <NA>                    NA
#> 13 0                           0 <NA>                    NA
#> 14 0                           0 <NA>                    NA
#> 15 0                           0 <NA>                    NA
#> 16 0                           0 <NA>                    NA
#> 17 0                           0 <NA>                    NA
#> 18 0                           0 <NA>                    NA
#> 19 2 weeks                     2 weeks                   14
#> 20 0                           0 <NA>                    NA

Created on 2023-10-14 with reprex v2.0.2

2 Likes

@neale thanks so much for the awesome guidance and for helping me reach a solution. The second option is more efficient and elegant! :slight_smile:

2 Likes