Left join changed my new column values to NA

Thank you for posting! Here is an outline of an effective post:

Describe your issue

  • What specifically do you need help with
  • Relevant context (e.g. public health trends or associated data infrastructure)
  • Timeline/urgency
  • (No sensitive or identifiable information)

What steps have you already taken to find an answer?

  • I’ve searched everywhere and can’t find a solution. I had a call with instructor and we can’t fix it.

Provide an example of your R code

pacman::p_load(rio, here,tidyverse,reprex)


data1 <- data.frame(
  stringsAsFactors = FALSE,
  facility_name = c("11:11 Eleven Eleven",
                    "1350 DISTILLING","18 Wok","2000 WOK","2000 WOK"),
  facility_address = c("1710 BRIARGATE BLVD STE 450, COLORADO SPRINGS, CO 80920",
                       "520 E PIKES PEAK AVE, COLORADO SPRINGS, CO 80903",
                       "6616 DALBY DR STE 130, COLORADO SPRINGS, CO 80923",
                       "115 E FILLMORE ST, COLORADO SPRINGS, CO 80907",
                       "115 E FILLMORE ST, COLORADO SPRINGS, CO 80907"),
  permit_type = c("Restaurant (0-100 seats)",
                  "Limited Food Service","Restaurant (0-100 seats)",
                  "Restaurant (0-100 seats)","Restaurant (0-100 seats)"),
  inspection_date = c("2023-12-05","2023-02-01",
                      "2023-07-28","2023-05-04","2023-05-25"),
  inspection_purpose = c("Routine","Routine",
                         "Routine","Routine","Re-Inspection"),
  inspection_result = c("Pass","Pass","Pass",
                        "Re-Inspection Required","Pass"),
  permit_number = c("PR8287","PR0003934",
                    "PR7492","PR0007126","PR0007126")
)

data2 <- data.frame(
  stringsAsFactors = FALSE,
  establishment_name = c("LASTRELLA ENTERPRISE LLC",
                         "Cygnus Home Service LLC",
                         "OLD CHICAGO TAPROOM II LLC","Summit Subco LLC","Colorado Coffee Cart LLC"),
  address = c("5490 POWERS CENTER POINT UNIT 164 COLORADO SPRINGS CO 80920",
              "3440 KIMBALL CIR  COLORADO SPRINGS CO 80910",
              "4110 N ACADEMY BLVD  COLORADO SPRINGS CO 80918",
              "1018 W BAPTIST RD  COLORADO SPRINGS CO 80921",
              "10755 S HOLMES RD  COLORADO SPRINGS CO 80908"),
  ownership_type = c("Corporation/LLC",
                     "Corporation/LLC","Corporation/LLC","Corporation/LLC",
                     "Corporation/LLC")
)
 
joined_data2 <- left_join(data1, data2,
                           by = c("facility_address" = "address"))%>%
  select(facility_name, facility_address, establishment_name, ownership_type)

print(joined_data2)
#>         facility_name                                        facility_address
#> 1 11:11 Eleven Eleven 1710 BRIARGATE BLVD STE 450, COLORADO SPRINGS, CO 80920
#> 2     1350 DISTILLING        520 E PIKES PEAK AVE, COLORADO SPRINGS, CO 80903
#> 3              18 Wok       6616 DALBY DR STE 130, COLORADO SPRINGS, CO 80923
#> 4            2000 WOK           115 E FILLMORE ST, COLORADO SPRINGS, CO 80907
#> 5            2000 WOK           115 E FILLMORE ST, COLORADO SPRINGS, CO 80907
#>   establishment_name ownership_type
#> 1               <NA>           <NA>
#> 2               <NA>           <NA>
#> 3               <NA>           <NA>
#> 4               <NA>           <NA>
#> 5               <NA>           <NA>

Created on 2024-06-20 with reprex v2.1.0

Session info
sessionInfo()
#> R version 4.3.2 (2023-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/Denver
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] reprex_2.1.0    lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1  
#>  [5] dplyr_1.1.4     purrr_1.0.2     readr_2.1.5     tidyr_1.3.1    
#>  [9] tibble_3.2.1    ggplot2_3.4.4   tidyverse_2.0.0 here_1.0.1     
#> [13] rio_1.0.1      
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.4      compiler_4.3.2    tidyselect_1.2.0  scales_1.3.0     
#>  [5] yaml_2.3.8        fastmap_1.1.1     R6_2.5.1          generics_0.1.3   
#>  [9] knitr_1.45        munsell_0.5.0     rprojroot_2.0.4   tzdb_0.4.0       
#> [13] pillar_1.9.0      rlang_1.1.3       utf8_1.2.4        stringi_1.8.3    
#> [17] xfun_0.41         fs_1.6.3          timechange_0.3.0  cli_3.6.2        
#> [21] withr_3.0.0       magrittr_2.0.3    digest_0.6.34     grid_4.3.2       
#> [25] rstudioapi_0.16.0 hms_1.1.3         lifecycle_1.0.4   vctrs_0.6.5      
#> [29] evaluate_0.23     glue_1.7.0        fansi_1.0.6       colorspace_2.1-0 
#> [33] pacman_0.5.1      rmarkdown_2.25    tools_4.3.2       pkgconfig_2.0.3  
#> [37] htmltools_0.5.7

Follow-up

  • Thank the volunteers who try to help you
  • Mark one reply as the “Solution” if appropriate

Hi Alicia,

Thanks for posting. From your example the column facility_address doesn’t have any matches in address and that is why the values are NA.

If you run the below then it will tell you how many rows actually match in your data.

table(data1$facility_address %in% data2$address)
table(data2$address %in% data1$facility_address)

rows_in_data1 <- nrow(data1)
rows_in_data2 <- nrow(data2)
rows_in_data1_not_data2 <- anti_join(data1, data2,by = c("facility_address" = "address")) %>% nrow()
rows_in_data2_not_data1 <- anti_join(data2, data1 ,by = c("address" = "facility_address")) %>% nrow()

rows_in_data1 - rows_in_data1_not_data2 # see if these are zero
rows_in_data2 - rows_in_data2_not_data1 ## see if these are zero

If the output of these tables is that everything is false then you’ll need to clean the addresses so they are in the same format. It might be that you don’t have any matches between your two datasets. The most likely reason for you to get NA is that the columns do not match.