4.2 Reading external data
readr
package, part of tidyverse reads flat files. Most of readr’s
functions are concerned with turning flat files into tibbles. In
particular, * read_csv
reads comma delimited files * read_csv2
reads semicolon separated files (common in countries where , is used as
the decimal place) * read_tsv
reads tab delimited files *
read_delim
reads in files with any delimiter.
Notice the _
in the functions compared to .
in base R.
You can download the data used in the rest of the chapter
library(tidyverse)
<- read_csv("data/NOLA_STR.csv")
nola_str_tib ## Rows: 89 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Permit Number, Address, Permit Type, Residential Subtype, Current ...
## dbl (2): Bedroom Limit, Guest Occupancy Limit
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nola_str_tib## # A tibble: 89 × 12
## `Permit Number` Address `Permit Type` `Residential S…` `Current Status`
## <chr> <chr> <chr> <chr> <chr>
## 1 <NA> 1717 Robert … Short Term R… N/A Pending
## 2 <NA> 1006 Race St Short Term R… N/A Pending
## 3 <NA> 2634 Louisia… Short Term R… N/A Pending
## 4 <NA> 3323 Rosalie… Short Term R… Residential Par… Pending
## 5 <NA> 1525 Melpome… Short Term R… Residential Sma… Pending
## 6 22-RSTR-15568 3112 Octavia… Short Term R… Residential Par… Issued
## 7 <NA> 3149 Chartre… Commercial S… N/A Pending
## 8 <NA> 1952 Treasur… Short Term R… Residential Sma… Pending
## 9 <NA> 4616 S Rober… Short Term R… Residential Par… Pending
## 10 <NA> 3043 St Phil… Short Term R… Residential Par… Pending
## # … with 79 more rows, and 7 more variables: `Expiration Date` <chr>,
## # `Bedroom Limit` <dbl>, `Guest Occupancy Limit` <dbl>,
## # `Operator Name` <chr>, `License Holder Name` <chr>,
## # `Application Date` <chr>, Issue_Date <chr>
Sometimes there are a few lines of metadata at the top of the file. You
can use skip = n
to skip the first n lines; or use comment = "#"
to
drop all lines that start with (e.g.) #.
By default, read_csv
assumes that the first line is a header. If this
is not the case, use col_names = FALSE
or pass a character vector to
col_names
readr
uses a heuristic to figure out the type of each column: it reads
the first 1000 rows and uses some (moderately conservative) heuristics
to figure out the type of each column. In large data files the first
1000 rows may not sufficiently general, so you might have to specify the
col_types
.
<- read_csv("data/NOLA_STR.csv",
nola_str_tib col_types = cols(
`Permit Number` = col_character(),
`Address` = col_character(),
`Permit Type` = col_factor(),
`Application Date` = col_date(format = "%m/%d/%y"),
`Issue_Date` = col_date(format = "%m/%d/%y")
)
)
str(nola_str_tib)
## spec_tbl_df [89 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Permit Number : chr [1:89] NA NA NA NA ...
## $ Address : chr [1:89] "1717 Robert C Blakes, SR Dr" "1006 Race St" "2634 Louisiana Ave" "3323 Rosalie Aly" ...
## $ Permit Type : Factor w/ 3 levels "Short Term Rental Commercial Owner",..: 1 1 1 2 2 2 3 2 2 2 ...
## $ Residential Subtype : chr [1:89] "N/A" "N/A" "N/A" "Residential Partial Unit" ...
## $ Current Status : chr [1:89] "Pending" "Pending" "Pending" "Pending" ...
## $ Expiration Date : chr [1:89] NA NA NA NA ...
## $ Bedroom Limit : num [1:89] 5 5 3 1 3 1 2 2 1 2 ...
## $ Guest Occupancy Limit: num [1:89] 10 10 6 2 6 2 4 4 2 4 ...
## $ Operator Name : chr [1:89] "Melissa Taranto" "Michael Heyne" "Michael Heyne" "Caroline Stas" ...
## $ License Holder Name : chr [1:89] "Scott Taranto" "Boutique Hospitality" "Resonance Home LLC" "Caroline Stas" ...
## $ Application Date : Date[1:89], format: "2022-08-09" "2022-08-09" ...
## $ Issue_Date : Date[1:89], format: NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. `Permit Number` = col_character(),
## .. Address = col_character(),
## .. `Permit Type` = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. `Residential Subtype` = col_character(),
## .. `Current Status` = col_character(),
## .. `Expiration Date` = col_character(),
## .. `Bedroom Limit` = col_double(),
## .. `Guest Occupancy Limit` = col_double(),
## .. `Operator Name` = col_character(),
## .. `License Holder Name` = col_character(),
## .. `Application Date` = col_date(format = "%m/%d/%y"),
## .. Issue_Date = col_date(format = "%m/%d/%y")
## .. )
## - attr(*, "problems")=<externalptr>
Notice how Permit Type
is now read in as a factor instead of a
character.