4.4 Relational Data
It is rarely the case that we will be working with one table. Often we need to join multiple tables together because information is scattered in different databases. Typically the relationships are defined using at set of columns.
Suppose there is a table that includes the license number of the
operator and we want to include it into the tibble nola_str_tib
.
<- read_csv("data/NOLA_STR_operators.csv")
nola_operators ## Rows: 59 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Operator Name, Operator Permit Number
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
First notice that Operator Name
is the same between the tables.
Presumably this is what we can use to join the tables.
Notice that there are only 59 operators in the
table compared to 89 rows in nola_str_tib
. Either
there are duplicates in nola_str_tib
or there are missing operators in
nola_operators
. It could also very well be the case that they are in
different order than nola_str_tib
. In any case, we cannot simply take
the column from nola_operators
and column bind it. We have to use
*_join()
type of functions to join the datasets together.
<- left_join(nola_str_tib, nola_operators,
nola_str_tib_join by = c("Operator Name" = "Operator Name")
)
str(nola_str_tib_join)
## spec_tbl_df [89 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Permit Number : chr [1:89] NA NA NA NA ...
## $ Address : chr [1:89] "1717 Robert C Blakes, SR Dr" "1006 Race St" "2634 Louisiana Ave" "3323 Rosalie Aly" ...
## $ Permit Type : Factor w/ 3 levels "Short Term Rental Commercial Owner",..: 1 1 1 2 2 2 3 2 2 2 ...
## $ Residential Subtype : chr [1:89] "N/A" "N/A" "N/A" "Residential Partial Unit" ...
## $ Current Status : chr [1:89] "Pending" "Pending" "Pending" "Pending" ...
## $ Expiration Date : chr [1:89] NA NA NA NA ...
## $ Bedroom Limit : num [1:89] 5 5 3 1 3 1 2 2 1 2 ...
## $ Guest Occupancy Limit : num [1:89] 10 10 6 2 6 2 4 4 2 4 ...
## $ Operator Name : chr [1:89] "Melissa Taranto" "Michael Heyne" "Michael Heyne" "Caroline Stas" ...
## $ License Holder Name : chr [1:89] "Scott Taranto" "Boutique Hospitality" "Resonance Home LLC" "Caroline Stas" ...
## $ Application Date : Date[1:89], format: "2022-08-09" "2022-08-09" ...
## $ Issue_Date : Date[1:89], format: NA NA ...
## $ Operator Permit Number: chr [1:89] "20-ostr-01377" "20-OSTR-30075" "20-OSTR-30075" "20-OSTR-00297" ...
## - attr(*, "spec")=
## .. cols(
## .. `Permit Number` = col_character(),
## .. Address = col_character(),
## .. `Permit Type` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. `Residential Subtype` = col_character(),
## .. `Current Status` = col_character(),
## .. `Expiration Date` = col_character(),
## .. `Bedroom Limit` = col_double(),
## .. `Guest Occupancy Limit` = col_double(),
## .. `Operator Name` = col_character(),
## .. `License Holder Name` = col_character(),
## .. `Application Date` = col_date(format = "%m/%d/%y"),
## .. Issue_Date = col_date(format = "%m/%d/%y")
## .. )
## - attr(*, "problems")=<externalptr>
Couple of things to notice here
left_join
keeps all the rows in the first table. SoOperator Permit Number
should have the same length as the rest of the table even if some are NA.Notice the use “” in the
by
rather than ``. This is because the matching is based on strings rather than names. This is a subtle difference and would occasionally trip students up.