Error must be compativle with existing data

,

Good afternoon,

I get the following error. However, my data frame is still created and appears to be all there. I’ve looked on this community, the Epi handbook, and as well as stack overflow for reasons for this type of error but have not had much luck for finding the root cause. What I am trying to do is create a “wide” table that colors the most recent year with red if the previous year has a higher total. I have my data both in wide and long format but have not been successful in figuring out how to color a value based on a previous value arranging by year.

Sorry I can’t seem to color but below in bold is what I am trying to do.

County Measure Year
A 1 2018
A 0 2019
A 1 2020
B 2 2018
B 3 2019
B 3 2020
C 4 2018
C 5 2019
C 4 2020

Here is my code as to where I get my error.

rn_fs ← clean_fs %>%
group_by(county) %>%
arrange(county, desc(clean_fs$year)) %>%
mutate(RN = row_number())

Error in [<-:
! Assigned data rn_fs <- ... must be compatible with existing data.
:heavy_multiplication_x: Existing data has 91 rows.
:heavy_multiplication_x: Element 1 of assigned data has 1092 rows.
:information_source: Only vectors of size 1 are recycled.
Caused by error in vectbl_recycle_rhs_rows():
! Can’t recycle input of size 1092 to size 91.

Backtrace:

  1. ├─base::[<-(*tmp*, “previousTwoYears”, value = <gropd_df[,9]>)
  2. └─tibble:::[<-.tbl_df(*tmp*, “previousTwoYears”, value = <gropd_df[,9]>)
  3. └─tibble:::tbl_subassign(x, i, j, value, i_arg, j_arg, substitute(value))
  4. └─tibble:::vectbl_recycle_rhs_rows(value, fast_nrow(xo), i_arg = NULL, value_arg, call)
    
1 Like

Hello,

Without having any further information about clean_fs, it’s hard to figure out what may be causing this issue. If you could provide a reproducible example that would help greatly.

All the best,

Tim

1 Like

Hi there. The error you’re having looks like it’s related to a mismatch in the number of rows when you’re trying to assign new data to your existing data frame. This kind of error typically happens when the size of the data you’re trying to assign (rn_fs) does not match the size of the existing data in your data frame (clean_fs).

In your code, the error is happening at the point where you are trying to mutate the clean_fs data frame with row_number(). The key part of the error message is this:

  • Existing data has 91 rows
  • Element 1 of assigned data has 1092 rows

This means that the clean_fs data frame has 91 rows, but the result of the operation you’re performing (mutate(RN = row_number())) is generating a data frame with 1092 rows.

I’m thinking of a few steps you can do to fix this issue:

  1. Check the grouping: Since you are using group_by(county), make sure that the data is correctly grouped. Sometimes, unexpected behaviors in grouped data can lead to errors like this.
  2. Inspect row numbers: After the arrange() and before the mutate(), inspect the number of rows in your data frame. You can use nrow(clean_fs).
  3. Examine mutate() and row_number(): The mutate() function with row_number() should not change the number of rows in your data frame. If it is, there might be something else going on with these functions. Make sure that row_number() is being calculated as you expect. You might want to try running just the mutate(RN = row_number()) part on a subset of your data to see what happens.

Try running each step of your pipeline (group_by() , arrange() , mutate() ) separately and inspect the output after each step. This can help you isolate exactly where the issue is occurring.

  1. One last idea is to check the variable scope: Is clean_fs$year the correct way to reference the year variable? In a dplyr chain, you usually don’t need to use the data frame name to reference its columns. So desc(year) might be what you need.

Let me know if any of these work.

2 Likes

Thank you Elham and Tim,

My R script was pulling a couple of lines above the code I shared that included the previousTwoYears variable. It wasn’t piped so I figured it couldn’t be causing the error. However, once I deleted the two lines of code, I no longer get the error message and it is working correctly.

My progress is a wide table with my trend data and a subset table with the first year and last year with the difference. Next, I’ll do a join on the county and once I have the difference within the wide format, I hope to apply some background color logic. There is probably a more efficient way to do this, but I haven’t found any functions that compare data values based on a condition in long format. Doing this as a static process isn’t that difficult but my goal is to have this be dynamic so when the next year arrives, it will update correctly.

1 Like

Hi Patrick,

I put together a snippet of code based on what I think you are trying to do, please let me know if I can help further:

# loading packages
library(tidyverse)
library(gt)

# creating fake data
fake_data <- tibble::tribble(
               ~County, ~Measure, ~Year,
                   "A",       1L, 2018L,
                   "A",       0L, 2019L,
                   "A",       1L, 2020L,
                   "B",       2L, 2018L,
                   "B",       3L, 2019L,
                   "B",       3L, 2020L,
                   "C",       4L, 2018L,
                   "C",       5L, 2019L,
                   "C",       4L, 2020L
               )

# transforming the data into a wide format
wide_data <- fake_data |>
    pivot_wider(names_from = Year, values_from = Measure)

# creating table
gt(data = wide_data) |>
    tab_style(
        style = cell_fill(color = "red"),
        locations = cells_body(columns = `2019`, rows = `2019` > `2018`)
    ) |>
    tab_style(
        style = cell_fill(color = "red"),
        locations = cells_body(columns = `2020`, rows = `2020` > `2019`)
    )
County 2018 2019 2020
A 1 0 1
B 2 3 3
C 4 5 4

Created on 2023-11-27 with reprex v2.0.2

Session info
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Toronto
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] gt_0.10.0       lubridate_1.9.3 forcats_1.0.0   stringr_1.5.0  
#>  [5] dplyr_1.1.3     purrr_1.0.2     readr_2.1.4     tidyr_1.3.0    
#>  [9] tibble_3.2.1    ggplot2_3.4.4   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.4      compiler_4.3.1    reprex_2.0.2      tidyselect_1.2.0 
#>  [5] xml2_1.3.5        scales_1.2.1      yaml_2.3.7        fastmap_1.1.1    
#>  [9] R6_2.5.1          generics_0.1.3    knitr_1.44        munsell_0.5.0    
#> [13] R.cache_0.16.0    tzdb_0.4.0        pillar_1.9.0      R.utils_2.12.2   
#> [17] rlang_1.1.1       utf8_1.2.4        stringi_1.7.12    xfun_0.40        
#> [21] sass_0.4.7        fs_1.6.3          timechange_0.2.0  cli_3.6.1        
#> [25] withr_2.5.1       magrittr_2.0.3    digest_0.6.33     grid_4.3.1       
#> [29] rstudioapi_0.15.0 hms_1.1.3         lifecycle_1.0.3   R.methodsS3_1.8.2
#> [33] R.oo_1.25.0       vctrs_0.6.4       evaluate_0.22     glue_1.6.2       
#> [37] styler_1.10.2     fansi_1.0.5       colorspace_2.1-0  rmarkdown_2.25   
#> [41] tools_4.3.1       pkgconfig_2.0.3   htmltools_0.5.6.1

Note that to see the red filling you will need to run this code in your own R session.

All the best,

Tim

Thank you Tim. This is really close to what I am working on. Is there a way to compare columns dynamically. The table will grow over-time and once the data for the next year is available, I’d prefer not to change the bolded section make the locations = cells_body(columns = **2019**, rows = **2019** > **2018**).
Below is what I’ve tried to do to make it more dynamic. When data is in long format, I can compare the data relatively easy. However, once I make the date in wide format, getting the column dynamically has been difficult.

#Create most recent two years
most_recent_compare ← fake_data %>%
select(County, Measure, Year) %>%
filter(Year==min(Year) | Year==max(Year)) %>%
distinct()

#row number in R (not quite right)
rn_fs ← most_recent_compare %>%
group_by(County) %>%
arrange(County, desc(most_recent_compare$Year)) %>%
mutate(RN = row_number()
)

wide_data_rn_fs ← most_recent_compare |>
pivot_wider(names_from = Year, values_from = Measure)

#Left join data
left_join(wide_data,wide_data_rn_fs, by = “County”)

1 Like

Hi Patrick,

Just to clarify, are you only looking to compare the minimum and maximum year in the data? For example, in the data I used above the minimum year would be 2018 and the maximum 2020, am I correct in assuming you aren’t using 2019 at all?

All the best,

Tim

Hi Tim,

It would be nice to be able to pick any two years to compare (for scaling later), but yes right now I want the color of column 2020 to appear in red if column 2020 is less than column 2018. I can do this in a static way, but I am after a dynamic solution (when 2021 populates I don’t want to have to update the script). I do want to show all the columns 2018, 2019, and 2020 and than color column 2020 with the conditional logic. I feel the current process I am on is inefficient; it seems like a lot of data is getting duplicated in data frames. Long format appears to be the easiest way to compare data. However, when I compare rows it treats each row as its own observational group (in your example “A”, 1L, 2018L, “A”, 0L, 2019L, “A”, 1L, 2020L are all treated as individual data).

Is there a way to create a variable that compares measures for year 2018 to year 2020 dependent on county? I’m hoping there is. Could it be something with utilizing group_by or do I need to create a new data frame?

1 Like

Hi Patrick,

I think what I have below would work for your specific case of comparing 2018 to 2020, however, you would need to generalize this and create a function if you are going to compare different years.

# loading packages
library(tidyverse)
library(gt)

# creating fake data
fake_data <- tibble::tribble(
               ~County, ~Measure, ~Year,
                   "A",       1L, 2018L,
                   "A",       0L, 2019L,
                   "A",       1L, 2020L,
                   "B",       2L, 2018L,
                   "B",       3L, 2019L,
                   "B",       3L, 2020L,
                   "C",       4L, 2018L,
                   "C",       5L, 2019L,
                   "C",       4L, 2020L
               )

# creating flag
fake_data |>
    filter(Year %in% c(2018, 2020)) |>
    group_by(County) |>
    mutate(flag = Measure > lag(
        x = Measure,
        n = 1,
        default = NA_integer_,
        order_by = Year
    )) |>
    ungroup()
#> # A tibble: 6 × 4
#>   County Measure  Year flag 
#>   <chr>    <int> <int> <lgl>
#> 1 A            1  2018 NA   
#> 2 A            1  2020 FALSE
#> 3 B            2  2018 NA   
#> 4 B            3  2020 TRUE 
#> 5 C            4  2018 NA   
#> 6 C            4  2020 FALSE

Created on 2023-11-29 with reprex v2.0.2

Session info
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Toronto
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] gt_0.10.0       lubridate_1.9.3 forcats_1.0.0   stringr_1.5.0  
#>  [5] dplyr_1.1.3     purrr_1.0.2     readr_2.1.4     tidyr_1.3.0    
#>  [9] tibble_3.2.1    ggplot2_3.4.4   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.4      compiler_4.3.1    reprex_2.0.2      tidyselect_1.2.0 
#>  [5] xml2_1.3.5        scales_1.2.1      yaml_2.3.7        fastmap_1.1.1    
#>  [9] R6_2.5.1          generics_0.1.3    knitr_1.44        munsell_0.5.0    
#> [13] R.cache_0.16.0    tzdb_0.4.0        pillar_1.9.0      R.utils_2.12.2   
#> [17] rlang_1.1.1       utf8_1.2.4        stringi_1.7.12    xfun_0.40        
#> [21] fs_1.6.3          timechange_0.2.0  cli_3.6.1         withr_2.5.1      
#> [25] magrittr_2.0.3    digest_0.6.33     grid_4.3.1        rstudioapi_0.15.0
#> [29] hms_1.1.3         lifecycle_1.0.3   R.methodsS3_1.8.2 R.oo_1.25.0      
#> [33] vctrs_0.6.4       evaluate_0.22     glue_1.6.2        styler_1.10.2    
#> [37] fansi_1.0.5       colorspace_2.1-0  rmarkdown_2.25    tools_4.3.1      
#> [41] pkgconfig_2.0.3   htmltools_0.5.6.1

All the best,

Tim

Thank you Tim

You are getting me on the correct path. I tried the below syntax to try and create a dynamic solution. It doesn’t work due to putting a function in the concatenation step, but I think all I have to do is create a min and max variable and then compare them.

fake_data |>
filter(Year %in% c(max(year), min(year))) |>
group_by(County) |>
mutate(flag = Measure > lag(
x = Measure,
n = 1,
default = NA_integer_,
order_by = Year
)) |>
ungroup()

Thanks for all your input.

Quick question: How come you use |> instead of %>%

1 Like

Hi Patrick,

You didn’t capitalize Year in your code and that’s why things didn’t run. However, presumably you won’t just be comparing the minimum and maximum years in your data which is why I suggested turning this into a function so you can provide two years or more.

See below:

# loading packages
library(tidyverse)
library(gt)

# creating fake data
fake_data <- tibble::tribble(
               ~County, ~Measure, ~Year,
                   "A",       1L, 2018L,
                   "A",       0L, 2019L,
                   "A",       1L, 2020L,
                   "B",       2L, 2018L,
                   "B",       3L, 2019L,
                   "B",       3L, 2020L,
                   "C",       4L, 2018L,
                   "C",       5L, 2019L,
                   "C",       4L, 2020L
               )

# creating a function that takes two years
two_years <- function(.data, year_1, year_2) {
    .data |>
        filter(Year %in% c(year_1, year_2)) |>
        group_by(County) |>
        mutate(flag = Measure > lag(
            x = Measure,
            n = 1,
            default = NA_integer_,
            order_by = Year
        )) |>
        ungroup()
}

# creating a function that takes multiple years
multiple_years <- function(.data, years) {
    .data |>
        filter(Year %in% years) |>
        group_by(County) |>
        mutate(flag = Measure > lag(
            x = Measure,
            n = 1,
            default = NA_integer_,
            order_by = Year
        )) |>
        ungroup()
}

# applying functions
two_years(fake_data, min(fake_data$Year), max(fake_data$Year))
#> # A tibble: 6 × 4
#>   County Measure  Year flag 
#>   <chr>    <int> <int> <lgl>
#> 1 A            1  2018 NA   
#> 2 A            1  2020 FALSE
#> 3 B            2  2018 NA   
#> 4 B            3  2020 TRUE 
#> 5 C            4  2018 NA   
#> 6 C            4  2020 FALSE
multiple_years(fake_data, c(2018, 2019, 2020))
#> # A tibble: 9 × 4
#>   County Measure  Year flag 
#>   <chr>    <int> <int> <lgl>
#> 1 A            1  2018 NA   
#> 2 A            0  2019 FALSE
#> 3 A            1  2020 TRUE 
#> 4 B            2  2018 NA   
#> 5 B            3  2019 TRUE 
#> 6 B            3  2020 FALSE
#> 7 C            4  2018 NA   
#> 8 C            5  2019 TRUE 
#> 9 C            4  2020 FALSE

Created on 2023-11-30 with reprex v2.0.2

Session info
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Toronto
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] gt_0.10.0       lubridate_1.9.3 forcats_1.0.0   stringr_1.5.0  
#>  [5] dplyr_1.1.3     purrr_1.0.2     readr_2.1.4     tidyr_1.3.0    
#>  [9] tibble_3.2.1    ggplot2_3.4.4   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.4      compiler_4.3.1    reprex_2.0.2      tidyselect_1.2.0 
#>  [5] xml2_1.3.5        scales_1.2.1      yaml_2.3.7        fastmap_1.1.1    
#>  [9] R6_2.5.1          generics_0.1.3    knitr_1.44        munsell_0.5.0    
#> [13] R.cache_0.16.0    tzdb_0.4.0        pillar_1.9.0      R.utils_2.12.2   
#> [17] rlang_1.1.1       utf8_1.2.4        stringi_1.7.12    xfun_0.40        
#> [21] fs_1.6.3          timechange_0.2.0  cli_3.6.1         withr_2.5.1      
#> [25] magrittr_2.0.3    digest_0.6.33     grid_4.3.1        rstudioapi_0.15.0
#> [29] hms_1.1.3         lifecycle_1.0.3   R.methodsS3_1.8.2 R.oo_1.25.0      
#> [33] vctrs_0.6.4       evaluate_0.22     glue_1.6.2        styler_1.10.2    
#> [37] fansi_1.0.5       colorspace_2.1-0  rmarkdown_2.25    tools_4.3.1      
#> [41] pkgconfig_2.0.3   htmltools_0.5.6.1

With respect to |> vs. %>%, the former is a feature of R now that is very similar and does not require the tidyverse ecosystem.

All the best,

Tim