4.3 Wrangling Data

dplyr is the workhorse package for exploratory data analysis. In particular, select, filter and mutate are useful.

  • select is used to select on columns.
  • filter is used to select on rows.
  • mutate changes/adds columns



nola_str_tib %>%
  select(c(`Permit Type`, `Residential Subtype`))
## # A tibble: 89 × 2
##    `Permit Type`                       `Residential Subtype`   
##    <fct>                               <chr>                   
##  1 Short Term Rental Commercial Owner  N/A                     
##  2 Short Term Rental Commercial Owner  N/A                     
##  3 Short Term Rental Commercial Owner  N/A                     
##  4 Short Term Rental Residential Owner Residential Partial Unit
##  5 Short Term Rental Residential Owner Residential Small Unit  
##  6 Short Term Rental Residential Owner Residential Partial Unit
##  7 Commercial STR                      N/A                     
##  8 Short Term Rental Residential Owner Residential Small Unit  
##  9 Short Term Rental Residential Owner Residential Partial Unit
## 10 Short Term Rental Residential Owner Residential Partial Unit
## # … with 79 more rows


nola_str_tib %>%
  filter(`Permit Type` == "Short Term Rental Residential Owner")
## # A tibble: 53 × 12
##    `Permit Number` Address       `Permit Type` `Residential S…` `Current Status`
##    <chr>           <chr>         <fct>         <chr>            <chr>           
##  1 <NA>            3323 Rosalie… Short Term R… Residential Par… Pending         
##  2 <NA>            1525 Melpome… Short Term R… Residential Sma… Pending         
##  3 22-RSTR-15568   3112 Octavia… Short Term R… Residential Par… Issued          
##  4 <NA>            1952 Treasur… Short Term R… Residential Sma… Pending         
##  5 <NA>            4616 S Rober… Short Term R… Residential Par… Pending         
##  6 <NA>            3043 St Phil… Short Term R… Residential Par… Pending         
##  7 22-RSTR-15454   1731 Third S… Short Term R… Residential Lar… Issued          
##  8 <NA>            1731 Third S… Short Term R… Residential Lar… Pending         
##  9 <NA>            4632 Frankli… Short Term R… Residential Par… Pending         
## 10 <NA>            621 Desire St Short Term R… Residential Par… Pending         
## # … with 43 more rows, and 7 more variables: `Expiration Date` <chr>,
## #   `Bedroom Limit` <dbl>, `Guest Occupancy Limit` <dbl>,
## #   `Operator Name` <chr>, `License Holder Name` <chr>,
## #   `Application Date` <date>, Issue_Date <date>


library(lubridate)

nola_str_tib %>%
  mutate(Backlogged = if_else(
    (today() - `Application Date` >= 15) & is.na(Issue_Date),
    T, F
  )) %>%
  filter(Backlogged == T) %>%
  select(`Address`, `Operator Name`, `License Holder Name`, `Application Date`)
## # A tibble: 42 × 4
##    Address                   `Operator Name`   `License Holde…` `Application D…`
##    <chr>                     <chr>             <chr>            <date>          
##  1 3149 Chartres St          Bruce Michael Fe… Bruce Ferweda    2022-08-05      
##  2 1952 Treasure St          Zedrick Price     Zedrick L Price  2022-08-05      
##  3 4616 S Robertson St       Paul Macres       Paul Macres      2022-08-05      
##  4 3043 St Philip St         Rachel Wellons    Rachel Wellons   2022-08-04      
##  5 1727 Henriette Delille St Jane Chaisson     David Trocquet … 2022-08-04      
##  6 1731 Third St 1B          Brenna White      Nicolette Jones  2022-08-04      
##  7 4859 Tchoupitoulas St     Jane Chaisson     David Trocquet   2022-08-04      
##  8 3826 Canal St             Jordan Jacobs     3828 Canal Stre… 2022-08-04      
##  9 4632 Franklin Ave         Cornelious Celes… Cornelious Cele… 2022-08-03      
## 10 621 Desire St             Angela Larson     Angela Rosetta … 2022-08-02      
## # … with 32 more rows

Few things to notice here.

  • There is no need to specify the tibble using $ to access the column names. dplyr assumes that you are working with the dataset that you specified earlier in the pipe. This makes the code cleaner.

  • When there are no spaces in the column names you can omit the ticks.

  • You can chain a number of functions as in mutate, filter and select to do complicated analyses in a simple way using pipe operator %>%.

  • We also used if_else to create a logical variable called Backlogged. Notice that mutate automatically creates and populates this new column. Also notice the date arithmetic as well as logical operator &. Be careful to make sure that the new column is the same length as the rest of the dataset.

  • Notice the use of (today()...). They are used to improve readability of code but also to specify the order of operations.