3.4 Importing data
Although creating data frames from existing data structures is extremely useful, by far the most common approach is to create a data frame by importing data from an external file. To do this, you’ll need to have your data correctly formatted and saved in a file format that R is able to recognise. Fortunately for us, R is able to recognise a wide variety of file formats, although in reality you’ll probably end up only using two or three regularly.
As an example, the data frame below is a subset of Short Term Rental Applications and their status in Orleans Parish (City of New Orleans). It has Owner Name, Operator Name, Permit Type, Status, Expiration Date etc. In addition, it also has the number of bedrooms and occupancy limits.
In the base R, we typically use read.csv
or read.table
to read in an external file. Please pay attention to the file path and modify accordingly.
Permit.Number | Address | Permit.Type | Residential.Subtype | Current.Status | Expiration.Date | Bedroom.Limit | Guest.Occupancy.Limit | Operator.Name | License.Holder.Name | Application.Date | Issue_Date |
---|---|---|---|---|---|---|---|---|---|---|---|
1717 Robert C Blakes, SR Dr | Short Term Rental Commercial Owner | N/A | Pending | 5 | 10 | Melissa Taranto | Scott Taranto | 8/9/22 | |||
1006 Race St | Short Term Rental Commercial Owner | N/A | Pending | 5 | 10 | Michael Heyne | Boutique Hospitality | 8/9/22 | |||
2634 Louisiana Ave | Short Term Rental Commercial Owner | N/A | Pending | 3 | 6 | Michael Heyne | Resonance Home LLC | 8/9/22 | |||
3323 Rosalie Aly | Short Term Rental Residential Owner | Residential Partial Unit | Pending | 1 | 2 | Caroline Stas | Caroline Stas | 8/9/22 | |||
1525 Melpomene St | Short Term Rental Residential Owner | Residential Small Unit | Pending | 3 | 6 | Craig Redgrave | Craig R Redgrave | 8/8/22 | |||
22-RSTR-15568 | 3112 Octavia St | Short Term Rental Residential Owner | Residential Partial Unit | Issued | 8/4/23 | 1 | 2 | Philip Wheeler | Philip Barrett Wheeler | 8/5/22 | 8/5/22 |
22-RSTR-14732 | 113 S Salcedo St | Short Term Rental Residential Owner | Residential Partial Unit | Issued | 8/2/23 | 1 | 2 | Michael Springer | Starr Nia | 7/24/22 | 8/3/22 |
22-RSTR-14693 | 1212 Mazant St A | Short Term Rental Residential Owner | Residential Partial Unit | Issued | 7/31/23 | 3 | 6 | Joseph Kennedy | Joseph Kennedy | 7/23/22 | 8/1/22 |
2421 Chartres St | Short Term Rental Residential Owner | Residential Small Unit | Pending | 2 | 4 | Marlene Sheely | Marlene Sheely | 7/23/22 | |||
22-RSTR-14696 | 1214 Mazant St A | Short Term Rental Residential Owner | Residential Small Unit | Issued | 7/31/23 | 3 | 6 | Joseph Kennedy | Joseph Kennedy | 7/23/22 | 8/1/22 |
814 Baronne St | Short Term Rental Commercial Owner | N/A | Pending | 3 | 6 | Danae Columbus | Danae J Columbus | 7/22/22 | |||
22-RSTR-14667 | 823 Mandeville St | Short Term Rental Residential Owner | Residential Partial Unit | Issued | 8/1/23 | 2 | 4 | Rachel Anthony | Rachel Anthony | 7/22/22 | 8/2/22 |
## 'data.frame': 89 obs. of 12 variables:
## $ Permit.Number : chr "" "" "" "" ...
## $ Address : chr "1717 Robert C Blakes, SR Dr" "1006 Race St" "2634 Louisiana Ave" "3323 Rosalie Aly" ...
## $ Permit.Type : chr "Short Term Rental Commercial Owner" "Short Term Rental Commercial Owner" "Short Term Rental Commercial Owner" "Short Term Rental Residential Owner" ...
## $ Residential.Subtype : chr "N/A" "N/A" "N/A" "Residential Partial Unit" ...
## $ Current.Status : chr "Pending" "Pending" "Pending" "Pending" ...
## $ Expiration.Date : chr "" "" "" "" ...
## $ Bedroom.Limit : int 5 5 3 1 3 1 2 2 1 2 ...
## $ Guest.Occupancy.Limit: int 10 10 6 2 6 2 4 4 2 4 ...
## $ Operator.Name : chr "Melissa Taranto" "Michael Heyne" "Michael Heyne" "Caroline Stas" ...
## $ License.Holder.Name : chr "Scott Taranto" "Boutique Hospitality" "Resonance Home LLC" "Caroline Stas" ...
## $ Application.Date : chr "8/9/22" "8/9/22" "8/9/22" "8/9/22" ...
## $ Issue_Date : chr "" "" "" "" ...
There are a couple of important things to bear in mind about data frames.
- These types of objects are known as rectangular data (or tidy data) as each column must have the same number of observations. Also, any missing data should be recorded as an
NA
just as we did with our vectors. - R tries to guess the data type based on sampling a few rows. More often than not, it gets it wrong. This is why I explicitly specified stringsAsFactors = FALSE to prevent reading them as factors. Ideally you want to store variables such as Current.Status and Permit.Type as factors and Address and Operator.Name as character.
- Check to see how the date variables such as Issue_Date and Application.Date are read and stored.
str()
is often your friend.
Take a look at this video for a quick introduction to importing data in R
3.4.1 Saving files to import
The easiest method of creating a data file to import into R is to enter your data into a spreadsheet using either Microsoft Excel or LibreOffice Calc and save the spreadsheet as a tab delimited file. We prefer LibreOffice Calc as it’s open source, platform independent and free but MS Excel is OK too (but see here for some gotchas). I
There are a couple of things to bear in mind when saving files to import into R which will make your life easier in the long run.
- Keep your column headings (if you have them) short and informative. Also avoid spaces in your column headings by replacing them with an underscore or a dot (i.e. replace
opertator name
withoperator_name
oroperator.name
) - Avoid using special characters (i.e.
floor area (ft^2)
). - Remember, if you have missing data in your data frame (empty cells) you should use an
NA
to represent these missing values. This will keep the data frame tidy.
3.4.2 Import functions
Once you’ve saved your data file in a suitable format we can now read this file into R. The workhorse function for importing data into R is the read.csv()
function (we discuss some alternatives later in the chapter). The read.csv()
function is a very flexible function with a shed load of arguments (see ?read.csv
) but it’s quite simple to use.
Let’s import a tab delimited file called NOLA_str.txt
which contains the data we saw previously in this Chapter and assign it to an object called str
. The file is located in a data
directory which itself is located in our root directory. The first row of the data contains the variable (column) names. To use the read.csv()
function to import this file
<- read.csv(file = 'data/nola_STR.csv', header = TRUE,
nola_str stringsAsFactors = FALSE)
There are a few things to note about the above command. First, the file path and the filename (including the file extension) needs to be enclosed in either single or double quotes (i.e. the data/nola_STR.csv
bit) as the read.csv()
function expects this to be a character string. If your working directory is already set to the directory which contains the file, you don’t need to include the entire file path just the filename. In the example above, the file path is separated with a single forward slash /
. This will work regardless of the operating system you are using and we recommend you stick with this. However, Windows users may be more familiar with the single backslash notation and if you want to keep using this you will need to include them as double backslashes. Note though that the double backslash notation will not work on computers using Mac OSX or Linux operating systems.
<- read.csv(file = 'C:\\Documents\\Prog1\\data\\nola_STR.csv',
nola_str header = TRUE, stringsAsFactors = FALSE)
The header = TRUE
argument specifies that the first row of your data contains the variable names (i.e. nitrogen
, block
etc). If this is not the case you can specify header = FALSE
(actually, this is the default value so you can omit this argument entirely). The sep = "\t"
argument tells R that the file delimiter is a tab (\t
).
After importing our data into R it doesn’t appear that R has done much, at least nothing appears in the R Console! To see the contents of the data frame we could just type the name of the object as we have done previously. BUT before you do that, think about why you’re doing this. If your data frame is anything other than tiny, all you’re going to do is fill up your Console with data. It’s not like you can easily check whether there are any errors or that your data has been imported correctly. A much better solution is to use our old friend the str()
function to return a compact and informative summary of your data frame.
str(nola_str)
## 'data.frame': 89 obs. of 12 variables:
## $ Permit.Number : chr "" "" "" "" ...
## $ Address : chr "1717 Robert C Blakes, SR Dr" "1006 Race St" "2634 Louisiana Ave" "3323 Rosalie Aly" ...
## $ Permit.Type : chr "Short Term Rental Commercial Owner" "Short Term Rental Commercial Owner" "Short Term Rental Commercial Owner" "Short Term Rental Residential Owner" ...
## $ Residential.Subtype : chr "N/A" "N/A" "N/A" "Residential Partial Unit" ...
## $ Current.Status : chr "Pending" "Pending" "Pending" "Pending" ...
## $ Expiration.Date : chr "" "" "" "" ...
## $ Bedroom.Limit : int 5 5 3 1 3 1 2 2 1 2 ...
## $ Guest.Occupancy.Limit: int 10 10 6 2 6 2 4 4 2 4 ...
## $ Operator.Name : chr "Melissa Taranto" "Michael Heyne" "Michael Heyne" "Caroline Stas" ...
## $ License.Holder.Name : chr "Scott Taranto" "Boutique Hospitality" "Resonance Home LLC" "Caroline Stas" ...
## $ Application.Date : chr "8/9/22" "8/9/22" "8/9/22" "8/9/22" ...
## $ Issue_Date : chr "" "" "" "" ...
Here we see that nola_str
is a ‘data.frame’ object which contains 89 rows and 13 variables (columns). Each of the variables are listed along with their data class and the first 10 values. As we mentioned previously in this Chapter, it can be quite convenient to copy and paste this into your R script as a comment block for later reference.
From R version 4.0.0 you can just leave out this argument as stringsAsFactors = FALSE
is the default.
Other useful arguments include dec =
and na.strings =
. The dec =
argument allows you to change the default character (.
) used for a decimal point. This is useful if you’re in a country where decimal places are usually represented by a comma (i.e. dec = ","
). The na.strings =
argument allows you to import data where missing values are represented with a symbol other than NA
. This can be quite common if you are importing data from other statistical software such as Minitab which represents missing values as a *
(na.strings = "*"
).
If we just wanted to see the names of our variables (columns) in the data frame we can use the names()
function which will return a character vector of the variable names.
names(str)
## NULL
read.csv
is basically a thin wrapper around the workhorse function, read.table
. R has a number of variants of the read.table()
function that you can use to import a variety of file formats. Actually, these variants just use the read.table()
function but include different combinations of arguments by default to help import different file types. Similar to read.csv()
, you can also use read.csv2()
and read.delim()
functions. The read.csv()
function is used to import comma separated value (.csv) files and assumes that the data in columns are separated by a comma (it sets sep = ","
by default). It also assumes that the first row of the data contains the variable names by default (it sets header = TRUE
by default). The read.csv2()
function assumes data are separated by semicolons and that a comma is used instead of a decimal point (as in many European countries). The read.delim()
function is used to import tab delimited data and also assumes that the first row of the data contains the variable names by default.
# import .csv file
<- read.csv(file = 'data/nola_STR.csv')
nola_str
# import .csv file with dec = "," and sep = ";"
<- read.csv2(file = 'data/nola_STR.csv')
nola_str
# import tab delim file with sep = "\t"
<- read.delim(file = 'data/nola_STR.txt') nola_str
You can even import spreadsheet files from MS Excel or other statistics software, using packages, directly into R but our advice is that this should generally be avoided if possible as it just adds a layer of uncertainty between you and your data. In our opinion it’s almost always better to export your spreadsheets as tab or comma delimited files and then import them into R using the read.table()
function. If you’re hell bent on directly importing data from other software you will need to install the foreign
package which has functions for importing Minitab, SPSS, Stata and SAS files or the xlsx
package to import Excel spreadsheets.
3.4.3 Common import frustrations
It’s quite common to get a bunch of really frustrating error messages when you first start importing data into R. Perhaps the most common is
in file(file, "rt") : cannot open the connection
Error : Warning message:
In additionfile(file, "rt") :
In 'nola_STR.csv': No such file or directory cannot open file
This error message is telling you that R cannot find the file you are trying to import. It usually rears its head for one of a couple of reasons (or all of them!). The first is that you’ve made a mistake in the spelling of either the filename or file path. Another common mistake is that you have forgotten to include the file extension in the filename (i.e. .txt
). Lastly, the file is not where you say it is or you’ve used an incorrect file path. Using RStudio Projects and having a logical directory structure goes along way to avoiding these types of errors.
3.4.4 Other import options
There are numerous other functions to import data from a variety of sources and formats. Most of these functions are contained in packages that you will need to install before using them. We list a couple of the more useful packages and functions below.
The fread()
function from the read.table
package is great for importing large data files quickly and efficiently (much faster than the read.table()
function). One of the great things about the fread()
function is that it will automatically detect many of the arguments you would normally need to specify (like sep =
etc). One of the things you will need to consider though is that the fread()
function will return a data.table
object not a data.frame
as would be the case with the read.table()
function. This is usually not a problem as you can pass a data.table
object to any function that only accepts data.frame
objects. To learn more about the differences between data.table
and data.frame
objects see here.
library(read.table)
<- fread(file = 'data/nola_str.txt') all_data
Various functions from the readr
package are also very efficient at reading in large data files. The readr
package is part of the ‘tidyverse’ collection of packages and provides many equivalent functions to base R for importing data. The readr
functions are used in a similar way to the read.table()
or read.csv()
functions and many of the arguments are the same (see ?readr::read_table
for more details). There are however some differences. For example, when using the read_table()
function the header = TRUE
argument is replaced by col_names = TRUE
and the function returns a tibble
class object which is the tidyverse equivalent of a data.frame
object (see here for differences).
In PLAN 672, we almost exclusively use functions from readr
packages. I strongly recommend it.
library(readr)
# import white space delimited files
<- read_table(file = 'data/nola_STR.txt', col_names = TRUE)
all_data
# import comma delimited files
<- read_csv(file = 'data/nola_STR.txt')
all_data
# import tab delimited files
<- read_delim(file = 'data/nola_STR.txt', delim = "\t")
all_data
# or use
<- read_tsv(file = 'data/nola_STR.txt') all_data
If your data file is ginormous, then the ff
and bigmemory
packages may be useful as they both contain import functions that are able to store large data in a memory efficient manner. You can find out more about these functions here and here.