Stuck with an error

sarman · January 31, 2024, 1:47pm

Hello everyone,

can someone help me to understand the error? I have two datasets, in both all column types are “character”, but the code works only for one. It tells me the following for the second dataset.

This is my super easy code: skimr::skim(surv_raw_paper)

And this is what it says when I run it:
Error in dplyr::summarize():
In argument: skimmed = purrr::map2(...).
In group 1: skim_type = "character".
Caused by error in purrr::map2():
In index: 1.
With name: character.
Caused by error in purrr::map():
In index: 1.
Caused by error in dplyr::select():
! Problem while evaluating tidyselect::starts_with(delim_name, ignore.case = FALSE).
Caused by error in substr():
! invalid multibyte string, element 169
Run rlang::last_trace() to see where the error occurred.

I was not able to solve it with the Epi R Handbook, nor with Google. But that is probably my R-beginner-problem, not that the answer is not our there.

Question: Do I understand it right that the problem is the type of my columns? Or is there another problem? Can someone help me out to understand the error?

Thanks a lot!
Best
Navina

machupovirus · February 1, 2024, 1:29pm

Hello Navina,

This looks like an issue with a string value for a certain variable in your data, however, it is hard to say definitively without a reproducible example. Please see the following post to help you create a reprex so that we can troubleshoot this issue further.

All the best,

Tim

elham.yusufali · February 1, 2024, 2:07pm

+1 to Tim’s comment. Would like to see a reproducible example to help with potential solutions.

The error message isn’t directly about the column types being “character.” Instead, it is related to the handling of character strings, specifically dealing with a string that contains invalid or problematic multibyte characters.

The specific issue is with the substr() function, which extracts or replaces substrings in a character vector. The error message ! invalid multibyte string, element 169 indicates that there’s an issue with the 169th element of a character vector, likely involving a string that is not properly encoded or has invalid multibyte characters.

sarman · February 1, 2024, 4:53pm

Hi Tim, hi Elham,

thank you so much for your quick and super helpful answers. Much appreciated!

I somehow managed to create a small version of the data set / reprex / so that you can reproduce it, but the thing is: with this it works perfectly. There is no error. Now the challenge is to find the variable with the problem (I randomly chose 5 variables for the small example, but those did not create the problem). I have an idea, but I cannot include it in my reprex. There are some variables with ä/ü/ö which are shown in R after the upload as a questionmark/?. I have no idea how to change that (but I will find it out) and I have no idea how to put them in the small data set for you, because R neither excepts for instance “RA_Gesch?ftsreise” nor “RA_Geschäftsreise” in the test_data-code. Could it be that those special letters (ä/ü/ö) create the error when I try to skim?

I hope I could express my thoughts in a way so that you can understand what I am trying to say… It might sound a bit confusing. I hope I don’t have to ask strange questions forever!

Best
Navina

# Load packages -----------------------------------------------------------

pacman::p_load(rio, 
               here, 
               tidyverse, 
               skimr,
               plyr,
               janitor,
               lubridate,
               gtsummary, 
               flextable,
               officer,
               epikit, 
               apyramid, 
               scales,
               datapasta, 
               reprex)




# create a small version of your data set

test_data <- data.frame(
  stringsAsFactors = FALSE,
  row.names = c("2", "3", "4", "5", "6", "7", "8", "9", "10", "11"),
  GeburtsJahr = c("1981","1981","1996","1985",
                  "1979","1997","2004","1950","1985","2018"),
  GeburtsMonat = c("11", "3", "4", "9", "3", "2", "12", "10", "10", "11"),
  Geschlecht = c("weiblich","mÃ¤nnlich",
                 "weiblich","weiblich","mÃ¤nnlich","mÃ¤nnlich","weiblich",
                 "mÃ¤nnlich","mÃ¤nnlich","mÃ¤nnlich"),
  Spezies = c("Plasmodium falciparum (M. tropica)","Plasmodium falciparum (M. tropica)",
              "Plasmodium falciparum (M. tropica)","-nicht ermittelbar-",
              "Plasmodium falciparum (M. tropica)",
              "Plasmodium ovale (M. tertiana)","Plasmodium falciparum (M. tropica)",
              "Plasmodium falciparum (M. tropica)",
              "Plasmodium falciparum (M. tropica)","Plasmodium falciparum (M. tropica)"),
  Infektionsland = c("-nicht erhoben-",
                     "-nicht erhoben-","Kamerun","Nigeria","Togo","Ostafrika",
                     "-nicht erhoben-","-nicht erhoben-","-nicht erhoben-","Ghana")
)


# Optimize spelling (ä, ü, ...)
test_data$Geschlecht <- iconv(test_data$Geschlecht, from = "ISO-8859-1", to = "UTF-8")

# Look at the data
skimr::skim(test_data)


Name	test_data
Number of rows	10
Number of columns	5
_______________________
Column type frequency:
character	5
________________________
Group variables	None

Data summary

Variable type: character

skim_variable	complete_rate	min	max	n_unique
GeburtsJahr	1	4	4	8
GeburtsMonat	1	1	2	7
Geschlecht	1	8	11	2
Spezies	1	19	34	3
Infektionsland	1	4	15	6

^{Created on 2024-02-01 with reprex v2.0.2}

Session info

sessionInfo()
#> R version 4.3.0 (2023-04-21 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
#> [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
#> [5] LC_TIME=German_Germany.utf8    
#> 
#> time zone: Europe/Berlin
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] reprex_2.0.2    datapasta_3.1.0 scales_1.2.1    apyramid_0.1.3 
#>  [5] epikit_0.1.5    officer_0.6.2   flextable_0.9.2 gtsummary_1.7.2
#>  [9] janitor_2.2.0   plyr_1.8.9      skimr_2.1.5     lubridate_1.9.2
#> [13] forcats_1.0.0   stringr_1.5.0   dplyr_1.1.3     purrr_1.0.2    
#> [17] readr_2.1.4     tidyr_1.3.0     tibble_3.2.1    ggplot2_3.4.3  
#> [21] tidyverse_2.0.0 here_1.0.1      rio_0.5.30     
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.0        fastmap_1.1.1           fontquiver_0.2.1       
#>  [4] pacman_0.5.1            promises_1.2.1          broom.helpers_1.14.0   
#>  [7] digest_0.6.33           timechange_0.2.0        mime_0.12              
#> [10] lifecycle_1.0.3         sf_1.0-14               gfonts_0.2.0           
#> [13] ellipsis_0.3.2          magrittr_2.0.3          compiler_4.3.0         
#> [16] rlang_1.1.1             tools_4.3.0             utf8_1.2.3             
#> [19] yaml_2.3.7              gt_0.9.0                data.table_1.14.8      
#> [22] knitr_1.43              askpass_1.2.0           classInt_0.4-9         
#> [25] curl_5.0.2              xml2_1.3.5              repr_1.1.6             
#> [28] KernSmooth_2.23-20      httpcode_0.3.0          withr_2.5.0            
#> [31] foreign_0.8-84          grid_4.3.0              fansi_1.0.4            
#> [34] gdtools_0.3.3           e1071_1.7-13            xtable_1.8-4           
#> [37] colorspace_2.1-0        crul_1.4.0              cli_3.6.1              
#> [40] rmarkdown_2.24          crayon_1.5.2            ragg_1.2.5             
#> [43] generics_0.1.3          rstudioapi_0.15.0       tzdb_0.4.0             
#> [46] readxl_1.4.3            proxy_0.4-27            DBI_1.1.3              
#> [49] cellranger_1.1.0        base64enc_0.1-3         vctrs_0.6.3            
#> [52] jsonlite_1.8.7          fontBitstreamVera_0.1.1 hms_1.1.3              
#> [55] systemfonts_1.0.4       units_0.8-3             glue_1.6.2             
#> [58] stringi_1.7.12          gtable_0.3.4            later_1.3.1            
#> [61] munsell_0.5.0           pillar_1.9.0            htmltools_0.5.6        
#> [64] openssl_2.1.0           R6_2.5.1                textshaping_0.3.6      
#> [67] rprojroot_2.0.3         evaluate_0.21           shiny_1.7.5            
#> [70] haven_2.5.3             openxlsx_4.2.5.2        snakecase_0.11.1       
#> [73] fontLiberation_0.1.0    httpuv_1.6.11           class_7.3-21           
#> [76] uuid_1.1-1              Rcpp_1.0.11             zip_2.3.0              
#> [79] xfun_0.40               fs_1.6.3                pkgconfig_2.0.3

sarman · February 2, 2024, 12:19pm

Tim, Elham, a quick update: Problem solved! It was indeed the special characters in the German language that were troubling me. When I save the EXCEL sheet as CSV UTF-8, R can read it perfectly. Thanks a lot for your time and helping me solve it. Best, Navina

machupovirus · February 2, 2024, 1:07pm

Happy to hear it!

All the best,

Tim