Removing units from a character variable to make it numeric

Describe your issue

How do I remove the units in a variable that is supposed to be numeric but became character because the unit was encoded along with the number? For example, 100mmHg instead of just 100 in variable systolicbp. Note that some values have no space between number and unit (100mmHg), while others have a space (110 mmHg).

What steps have you already taken to find an answer?

  • I searched stackoverflow but the solutions seem to be in base R rather than tidyverse that can be used in my cleaning pipechain.

Provide an example of your R code

sample <- data.frame(name = c('john', 'dan', 'chris'), 
     systolicbp_mmhg = c('100 mmHg', '110mmHg', '120 mmHg'),
     stringsAsFactors=FALSE)
1 Like

Hi Ian,

I have provided two potential solutions below:

# loading packages
library(tidyverse)

# creating fake data
fake_data <- data.frame(
    name = c('john', 'dan', 'chris'),
    systolicbp_mmhg = c('100 mmHg', '110mmHg', '120 mmHg'),
    stringsAsFactors = FALSE
) |>
    as_tibble()

# removing text from variables

# specific
fake_data |>
    mutate(systolicbp_mmhg = as.numeric(str_squish(str_remove(
        systolicbp_mmhg, "mmHg"
    ))))
#> # A tibble: 3 × 2
#>   name  systolicbp_mmhg
#>   <chr>           <dbl>
#> 1 john              100
#> 2 dan               110
#> 3 chris             120

# general
fake_data |>
    mutate(systolicbp_mmhg = as.numeric(str_squish(str_remove_all(
        systolicbp_mmhg, regex("[a-z]", ignore_case = TRUE)
    ))))
#> # A tibble: 3 × 2
#>   name  systolicbp_mmhg
#>   <chr>           <dbl>
#> 1 john              100
#> 2 dan               110
#> 3 chris             120

Created on 2023-12-10 with reprex v2.0.2

Session info
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Toronto
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.3    
#>  [5] purrr_1.0.2     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
#>  [9] ggplot2_3.4.4   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.4      compiler_4.3.1    reprex_2.0.2      tidyselect_1.2.0 
#>  [5] scales_1.2.1      yaml_2.3.7        fastmap_1.1.1     R6_2.5.1         
#>  [9] generics_0.1.3    knitr_1.44        munsell_0.5.0     R.cache_0.16.0   
#> [13] tzdb_0.4.0        pillar_1.9.0      R.utils_2.12.2    rlang_1.1.1      
#> [17] utf8_1.2.4        stringi_1.7.12    xfun_0.40         fs_1.6.3         
#> [21] timechange_0.2.0  cli_3.6.1         withr_2.5.1       magrittr_2.0.3   
#> [25] digest_0.6.33     grid_4.3.1        rstudioapi_0.15.0 hms_1.1.3        
#> [29] lifecycle_1.0.3   R.methodsS3_1.8.2 R.oo_1.25.0       vctrs_0.6.4      
#> [33] evaluate_0.22     glue_1.6.2        styler_1.10.2     fansi_1.0.5      
#> [37] colorspace_2.1-0  rmarkdown_2.25    tools_4.3.1       pkgconfig_2.0.3  
#> [41] htmltools_0.5.6.1

All the best,

Tim

3 Likes

Hi Ian,

Also just FYI in Tim’s two great solutions, the first one will specifically look for the letters ‘mmhg’ and remove them, while the second one is more generic and removes any letters with a regular expression that covers the whole alphabet and is case insensitive).

2 Likes

hello !

well the solution by @machupovirus is very good

however I have other solution:

I will use a regular expressions, it is very popular between data scientist, something like this:

I work with the same data of @machupovirus

2 Likes

Hi Tim, Amy, and Jorge! Thank you for the elegant solutions. I ended up doing something like this, since I didn’t know str_remove() or str_remove_all() commands exist. I also didn’t know that you could nest the different str_ commands like str_squish(str_remove(

# creating fake data
fake_data <- data.frame(
    name = c('john', 'dan', 'chris'),
    systolicbp_mmhg = c('100 mmHg', '110mmHg', '120 mmHg'),
    stringsAsFactors = FALSE
) |>
    as_tibble()

# clean
fake_data_clean <- fake_data  %>% 
  mutate(systolicbp_mmhg = str_replace_all(string = systolicbp_mmhg,
                                                     pattern = "mmHg",
                                                     replacement = ""),
         systolicbp_mmhg = str_trim(systolicbp_mmhg, "both"),
         systolicbp_mmhg = as.numeric(systolicbp_mmhg))

Comparing @machupovirus and @jestrada 's solutions, it seems that they are the opposite of each other with str_remove_all() specifying what to remove (in this case, the string “mmHg”) while str_extract() specifies what to retain (the actual measurement)?

Can you elaborate on the regex portion, specifically the {2,3} in the str_extract? Does it mean it will retain numeric values (0-9) that are 2-3 digits in length (so 10-999)?

Thanks again!

1 Like

hello @iancgmd !!!

Yes, Of course, you can put the {2,3} inside str_extract(), in fact I made that.

It is correct if you use {2,3}, it will retain numbers of 2 or 3 digits, if you need more , you can change it.

3 Likes

Thank you for clarifying, this has been very helpful!

2 Likes