4.4 Relational Data

It is rarely the case that we will be working with one table. Often we need to join multiple tables together because information is scattered in different databases. Typically the relationships are defined using at set of columns.

Suppose there is a table that includes the license number of the operator and we want to include it into the tibble nola_str_tib.


nola_operators <- read_csv("data/NOLA_STR_operators.csv")
## Rows: 59 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Operator Name, Operator Permit Number
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

First notice that Operator Name is the same between the tables. Presumably this is what we can use to join the tables.

Notice that there are only 59 operators in the table compared to 89 rows in nola_str_tib. Either there are duplicates in nola_str_tib or there are missing operators in nola_operators. It could also very well be the case that they are in different order than nola_str_tib. In any case, we cannot simply take the column from nola_operators and column bind it. We have to use *_join() type of functions to join the datasets together.


nola_str_tib_join <- left_join(nola_str_tib, nola_operators,
  by = c("Operator Name" = "Operator Name")
)

str(nola_str_tib_join)
## spec_tbl_df [89 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Permit Number         : chr [1:89] NA NA NA NA ...
##  $ Address               : chr [1:89] "1717 Robert C Blakes, SR Dr" "1006 Race St" "2634 Louisiana Ave" "3323 Rosalie Aly" ...
##  $ Permit Type           : Factor w/ 3 levels "Short Term Rental Commercial Owner",..: 1 1 1 2 2 2 3 2 2 2 ...
##  $ Residential Subtype   : chr [1:89] "N/A" "N/A" "N/A" "Residential Partial Unit" ...
##  $ Current Status        : chr [1:89] "Pending" "Pending" "Pending" "Pending" ...
##  $ Expiration Date       : chr [1:89] NA NA NA NA ...
##  $ Bedroom Limit         : num [1:89] 5 5 3 1 3 1 2 2 1 2 ...
##  $ Guest Occupancy Limit : num [1:89] 10 10 6 2 6 2 4 4 2 4 ...
##  $ Operator Name         : chr [1:89] "Melissa Taranto" "Michael Heyne" "Michael Heyne" "Caroline Stas" ...
##  $ License Holder Name   : chr [1:89] "Scott Taranto" "Boutique Hospitality" "Resonance Home LLC" "Caroline Stas" ...
##  $ Application Date      : Date[1:89], format: "2022-08-09" "2022-08-09" ...
##  $ Issue_Date            : Date[1:89], format: NA NA ...
##  $ Operator Permit Number: chr [1:89] "20-ostr-01377" "20-OSTR-30075" "20-OSTR-30075" "20-OSTR-00297" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Permit Number` = col_character(),
##   ..   Address = col_character(),
##   ..   `Permit Type` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   `Residential Subtype` = col_character(),
##   ..   `Current Status` = col_character(),
##   ..   `Expiration Date` = col_character(),
##   ..   `Bedroom Limit` = col_double(),
##   ..   `Guest Occupancy Limit` = col_double(),
##   ..   `Operator Name` = col_character(),
##   ..   `License Holder Name` = col_character(),
##   ..   `Application Date` = col_date(format = "%m/%d/%y"),
##   ..   Issue_Date = col_date(format = "%m/%d/%y")
##   .. )
##  - attr(*, "problems")=<externalptr>

Couple of things to notice here

  • left_join keeps all the rows in the first table. So Operator Permit Number should have the same length as the rest of the table even if some are NA.

  • Notice the use “” in the by rather than ``. This is because the matching is based on strings rather than names. This is a subtle difference and would occasionally trip students up.