Difference between == and %in% during filtering

iancgmd · October 22, 2023, 7:13am

Describe your issue

Hello! I’m trying to work on the exercises in the Forecasting: Principles and Practice textbook. In exercise 2.10, number 4, we are asked to:

Install the USgas package.
Create a tsibble from us_total with year as the index and state as the key.
Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

I tried filter(state == c("Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut", "Rhode Island")) which filtered the dataset to 23 observations, while filter(state %in% c("Maine", "Vermont", "New Hampshire", "Massachusetts", "Connecticut", "Rhode Island")) produces 138 observations.

My question is how do the two operators (== and %in%) produce different results during the filtering process?

Provide an example of your R code

# load packages
pacman::p_load(fpp3, USgas)

# create tsibble from us_total
us_tot <- as_tsibble(us_total,
                    index = year,
                    key = state)

# filter and plot using ==, this results in 23 observations
us_filter <- us_tot %>% filter(state == c("Maine", "Vermont", "New Hampshire",
                           "Massachusetts", "Connecticut", "Rhode Island"))

us_filter %>% autoplot(y/1e3) + 
labs(y = "billion cubic feet")

# filter and plot using %in%, this results in 138 observations
us_filter2 <- us_tot |>
filter(state %in% c("Maine", "Vermont", "New Hampshire", "Massachusetts",
                    "Connecticut", "Rhode Island"))

us_filter2 %>% autoplot(y/1e3) + 
labs(y = "billion cubic feet")

machupovirus · October 22, 2023, 2:10pm

Hi Ian,

You should not be using == in this scenario, in fact, the only reason you are not receiving a warning is because the number of rows in the data is a multiple of 6. So R recycles across the state vector 6n times, where n is the number of rows. Instead, you should be using the %in% operator or fct_match from the forcats package to filter here.

See below:

#Loading tidyverse
library(tidyverse)

# Simulating data
sim_data <- tibble(id = seq_len(length.out = 100)) |>
    rowwise() |>
    mutate(letters = sample(x = letters, size = 1, replace = TRUE)) |>
    ungroup()

# Testing both operators

# 100 is not a multiple of 3
sim_data |>
    filter(letters == c("a", "b", "c"))
#> Warning: There was 1 warning in `filter()`.
#> ℹ In argument: `letters == c("a", "b", "c")`.
#> Caused by warning in `letters == c("a", "b", "c")`:
#> ! longer object length is not a multiple of shorter object length
#> # A tibble: 2 × 2
#>      id letters
#>   <int> <chr>  
#> 1    12 c      
#> 2    61 a

# Essentially, the vector you filter letters by is being repeated and then
# checked for equality at each row, as seen below
sim_data |>
    mutate(filter = rep(x = c("a", "b", "c"), length.out = 100)) |>
    filter(letters == filter)
#> # A tibble: 2 × 3
#>      id letters filter
#>   <int> <chr>   <chr> 
#> 1    12 c       c     
#> 2    61 a       a

# 100 is a multiple of 4
sim_data |>
    filter(letters == c("a", "b", "c", "d"))
#> # A tibble: 3 × 2
#>      id letters
#>   <int> <chr>  
#> 1    60 d      
#> 2    61 a      
#> 3    76 d

# Essentially, the vector you filter letters by is being repeated and then
# checked for equality at each row, as seen below
sim_data |>
    mutate(filter = rep(x = c("a", "b", "c", "d"), length.out = 100)) |>
    filter(letters == filter)
#> # A tibble: 3 × 3
#>      id letters filter
#>   <int> <chr>   <chr> 
#> 1    60 d       d     
#> 2    61 a       a     
#> 3    76 d       d

# %in% operator
sim_data |>
    filter(letters %in% c("a", "b", "c"))
#> # A tibble: 9 × 2
#>      id letters
#>   <int> <chr>  
#> 1     8 c      
#> 2    12 c      
#> 3    45 b      
#> 4    50 a      
#> 5    51 a      
#> 6    61 a      
#> 7    73 b      
#> 8    80 c      
#> 9    87 a

# fct_match
sim_data |>
    filter(fct_match(f = letters, lvls = c("a", "b", "c")))
#> # A tibble: 9 × 2
#>      id letters
#>   <int> <chr>  
#> 1     8 c      
#> 2    12 c      
#> 3    45 b      
#> 4    50 a      
#> 5    51 a      
#> 6    61 a      
#> 7    73 b      
#> 8    80 c      
#> 9    87 a

^{Created on 2023-10-22 with reprex v2.0.2}

Session info

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.5.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Toronto
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.3    
#>  [5] purrr_1.0.2     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
#>  [9] ggplot2_3.4.3   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.4      compiler_4.3.1    reprex_2.0.2      tidyselect_1.2.0 
#>  [5] scales_1.2.1      yaml_2.3.7        fastmap_1.1.1     R6_2.5.1         
#>  [9] generics_0.1.3    knitr_1.44        munsell_0.5.0     R.cache_0.16.0   
#> [13] tzdb_0.4.0        pillar_1.9.0      R.utils_2.12.2    rlang_1.1.1      
#> [17] utf8_1.2.3        stringi_1.7.12    xfun_0.40         fs_1.6.3         
#> [21] timechange_0.2.0  cli_3.6.1         withr_2.5.0       magrittr_2.0.3   
#> [25] digest_0.6.33     grid_4.3.1        rstudioapi_0.15.0 hms_1.1.3        
#> [29] lifecycle_1.0.3   R.methodsS3_1.8.2 R.oo_1.25.0       vctrs_0.6.3      
#> [33] evaluate_0.21     glue_1.6.2        styler_1.10.2     fansi_1.0.4      
#> [37] colorspace_2.1-0  rmarkdown_2.25    tools_4.3.1       pkgconfig_2.0.3  
#> [41] htmltools_0.5.6

All the best,

Tim

iancgmd · October 23, 2023, 12:03pm

Hi Tim! Thanks for the quick response. When would it be appropriate to use == instead of %in% during filtering?

machupovirus · October 23, 2023, 12:13pm

Hi Ian,

You should only use the == operator when you are dealing with scalar values, i.e., single values that are not vectors. The %in% operator or fct_match should be used when you are dealing with a vector, as in your problem.

See below:

#Loading tidyverse
library(tidyverse)

# Simulating data
sim_data <- tibble(id = seq_len(length.out = 100)) |>
    rowwise() |>
    mutate(letters = sample(x = letters, size = 1, replace = TRUE)) |>
    ungroup()

# Filtering for a scalar value
sim_data |>
    filter(letters == "z")
#> # A tibble: 5 × 2
#>      id letters
#>   <int> <chr>  
#> 1     3 z      
#> 2    14 z      
#> 3    88 z      
#> 4    91 z      
#> 5    95 z

# Filtering for a vector value

## %in%
sim_data |>
    filter(letters %in% c("a", "b", "c"))
#> # A tibble: 14 × 2
#>       id letters
#>    <int> <chr>  
#>  1    10 c      
#>  2    18 c      
#>  3    20 c      
#>  4    26 c      
#>  5    29 a      
#>  6    32 b      
#>  7    51 b      
#>  8    68 c      
#>  9    73 a      
#> 10    75 b      
#> 11    76 a      
#> 12    81 c      
#> 13    90 a      
#> 14    96 c

## fct_match
sim_data |>
    filter(fct_match(f = letters, lvls = c("a", "b", "c")))
#> # A tibble: 14 × 2
#>       id letters
#>    <int> <chr>  
#>  1    10 c      
#>  2    18 c      
#>  3    20 c      
#>  4    26 c      
#>  5    29 a      
#>  6    32 b      
#>  7    51 b      
#>  8    68 c      
#>  9    73 a      
#> 10    75 b      
#> 11    76 a      
#> 12    81 c      
#> 13    90 a      
#> 14    96 c

^{Created on 2023-10-23 with reprex v2.0.2}

Session info

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.5.2
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Toronto
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0   dplyr_1.1.3    
#>  [5] purrr_1.0.2     readr_2.1.4     tidyr_1.3.0     tibble_3.2.1   
#>  [9] ggplot2_3.4.3   tidyverse_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.4      compiler_4.3.1    reprex_2.0.2      tidyselect_1.2.0 
#>  [5] scales_1.2.1      yaml_2.3.7        fastmap_1.1.1     R6_2.5.1         
#>  [9] generics_0.1.3    knitr_1.44        munsell_0.5.0     R.cache_0.16.0   
#> [13] tzdb_0.4.0        pillar_1.9.0      R.utils_2.12.2    rlang_1.1.1      
#> [17] utf8_1.2.3        stringi_1.7.12    xfun_0.40         fs_1.6.3         
#> [21] timechange_0.2.0  cli_3.6.1         withr_2.5.0       magrittr_2.0.3   
#> [25] digest_0.6.33     grid_4.3.1        rstudioapi_0.15.0 hms_1.1.3        
#> [29] lifecycle_1.0.3   R.methodsS3_1.8.2 R.oo_1.25.0       vctrs_0.6.3      
#> [33] evaluate_0.21     glue_1.6.2        styler_1.10.2     fansi_1.0.4      
#> [37] colorspace_2.1-0  rmarkdown_2.25    tools_4.3.1       pkgconfig_2.0.3  
#> [41] htmltools_0.5.6

All the best,

Tim

iancgmd · October 23, 2023, 1:09pm

Thank you for the explanation and the detailed examples, Tim!