4.2 Reading external data

readr package, part of tidyverse reads flat files. Most of readr’s functions are concerned with turning flat files into tibbles. In particular, * read_csv reads comma delimited files * read_csv2 reads semicolon separated files (common in countries where , is used as the decimal place) * read_tsv reads tab delimited files * read_delim reads in files with any delimiter.

Notice the _ in the functions compared to . in base R.

You can download the data used in the rest of the chapter

library(tidyverse)

nola_str_tib <- read_csv("data/NOLA_STR.csv")
## Rows: 89 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Permit Number, Address, Permit Type, Residential Subtype, Current ...
## dbl  (2): Bedroom Limit, Guest Occupancy Limit
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nola_str_tib
## # A tibble: 89 × 12
##    `Permit Number` Address       `Permit Type` `Residential S…` `Current Status`
##    <chr>           <chr>         <chr>         <chr>            <chr>           
##  1 <NA>            1717 Robert … Short Term R… N/A              Pending         
##  2 <NA>            1006 Race St  Short Term R… N/A              Pending         
##  3 <NA>            2634 Louisia… Short Term R… N/A              Pending         
##  4 <NA>            3323 Rosalie… Short Term R… Residential Par… Pending         
##  5 <NA>            1525 Melpome… Short Term R… Residential Sma… Pending         
##  6 22-RSTR-15568   3112 Octavia… Short Term R… Residential Par… Issued          
##  7 <NA>            3149 Chartre… Commercial S… N/A              Pending         
##  8 <NA>            1952 Treasur… Short Term R… Residential Sma… Pending         
##  9 <NA>            4616 S Rober… Short Term R… Residential Par… Pending         
## 10 <NA>            3043 St Phil… Short Term R… Residential Par… Pending         
## # … with 79 more rows, and 7 more variables: `Expiration Date` <chr>,
## #   `Bedroom Limit` <dbl>, `Guest Occupancy Limit` <dbl>,
## #   `Operator Name` <chr>, `License Holder Name` <chr>,
## #   `Application Date` <chr>, Issue_Date <chr>

Sometimes there are a few lines of metadata at the top of the file. You can use skip = n to skip the first n lines; or use comment = "#" to drop all lines that start with (e.g.) #.

By default, read_csv assumes that the first line is a header. If this is not the case, use col_names = FALSE or pass a character vector to col_names

readr uses a heuristic to figure out the type of each column: it reads the first 1000 rows and uses some (moderately conservative) heuristics to figure out the type of each column. In large data files the first 1000 rows may not sufficiently general, so you might have to specify the col_types.


nola_str_tib <- read_csv("data/NOLA_STR.csv",
  col_types = cols(
    `Permit Number` = col_character(),
    `Address` = col_character(),
    `Permit Type` = col_factor(),
    `Application Date` = col_date(format = "%m/%d/%y"),
    `Issue_Date` = col_date(format = "%m/%d/%y")
  )
)

str(nola_str_tib)
## spec_tbl_df [89 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Permit Number        : chr [1:89] NA NA NA NA ...
##  $ Address              : chr [1:89] "1717 Robert C Blakes, SR Dr" "1006 Race St" "2634 Louisiana Ave" "3323 Rosalie Aly" ...
##  $ Permit Type          : Factor w/ 3 levels "Short Term Rental Commercial Owner",..: 1 1 1 2 2 2 3 2 2 2 ...
##  $ Residential Subtype  : chr [1:89] "N/A" "N/A" "N/A" "Residential Partial Unit" ...
##  $ Current Status       : chr [1:89] "Pending" "Pending" "Pending" "Pending" ...
##  $ Expiration Date      : chr [1:89] NA NA NA NA ...
##  $ Bedroom Limit        : num [1:89] 5 5 3 1 3 1 2 2 1 2 ...
##  $ Guest Occupancy Limit: num [1:89] 10 10 6 2 6 2 4 4 2 4 ...
##  $ Operator Name        : chr [1:89] "Melissa Taranto" "Michael Heyne" "Michael Heyne" "Caroline Stas" ...
##  $ License Holder Name  : chr [1:89] "Scott Taranto" "Boutique Hospitality" "Resonance Home LLC" "Caroline Stas" ...
##  $ Application Date     : Date[1:89], format: "2022-08-09" "2022-08-09" ...
##  $ Issue_Date           : Date[1:89], format: NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Permit Number` = col_character(),
##   ..   Address = col_character(),
##   ..   `Permit Type` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   `Residential Subtype` = col_character(),
##   ..   `Current Status` = col_character(),
##   ..   `Expiration Date` = col_character(),
##   ..   `Bedroom Limit` = col_double(),
##   ..   `Guest Occupancy Limit` = col_double(),
##   ..   `Operator Name` = col_character(),
##   ..   `License Holder Name` = col_character(),
##   ..   `Application Date` = col_date(format = "%m/%d/%y"),
##   ..   Issue_Date = col_date(format = "%m/%d/%y")
##   .. )
##  - attr(*, "problems")=<externalptr>

Notice how Permit Type is now read in as a factor instead of a character.