3.4 Importing data

Although creating data frames from existing data structures is extremely useful, by far the most common approach is to create a data frame by importing data from an external file. To do this, you’ll need to have your data correctly formatted and saved in a file format that R is able to recognise. Fortunately for us, R is able to recognise a wide variety of file formats, although in reality you’ll probably end up only using two or three regularly.

As an example, the data frame below is a subset of Short Term Rental Applications and their status in Orleans Parish (City of New Orleans). It has Owner Name, Operator Name, Permit Type, Status, Expiration Date etc. In addition, it also has the number of bedrooms and occupancy limits.

In the base R, we typically use read.csv or read.table to read in an external file. Please pay attention to the file path and modify accordingly.

 

Permit.Number Address Permit.Type Residential.Subtype Current.Status Expiration.Date Bedroom.Limit Guest.Occupancy.Limit Operator.Name License.Holder.Name Application.Date Issue_Date
1717 Robert C Blakes, SR Dr Short Term Rental Commercial Owner N/A Pending 5 10 Melissa Taranto Scott Taranto 8/9/22
1006 Race St Short Term Rental Commercial Owner N/A Pending 5 10 Michael Heyne Boutique Hospitality 8/9/22
2634 Louisiana Ave Short Term Rental Commercial Owner N/A Pending 3 6 Michael Heyne Resonance Home LLC 8/9/22
3323 Rosalie Aly Short Term Rental Residential Owner Residential Partial Unit Pending 1 2 Caroline Stas Caroline Stas 8/9/22
1525 Melpomene St Short Term Rental Residential Owner Residential Small Unit Pending 3 6 Craig Redgrave Craig R Redgrave 8/8/22
22-RSTR-15568 3112 Octavia St Short Term Rental Residential Owner Residential Partial Unit Issued 8/4/23 1 2 Philip Wheeler Philip Barrett Wheeler 8/5/22 8/5/22
22-RSTR-14732 113 S Salcedo St Short Term Rental Residential Owner Residential Partial Unit Issued 8/2/23 1 2 Michael Springer Starr Nia 7/24/22 8/3/22
22-RSTR-14693 1212 Mazant St A Short Term Rental Residential Owner Residential Partial Unit Issued 7/31/23 3 6 Joseph Kennedy Joseph Kennedy 7/23/22 8/1/22
2421 Chartres St Short Term Rental Residential Owner Residential Small Unit Pending 2 4 Marlene Sheely Marlene Sheely 7/23/22
22-RSTR-14696 1214 Mazant St A Short Term Rental Residential Owner Residential Small Unit Issued 7/31/23 3 6 Joseph Kennedy Joseph Kennedy 7/23/22 8/1/22
814 Baronne St Short Term Rental Commercial Owner N/A Pending 3 6 Danae Columbus Danae J Columbus 7/22/22
22-RSTR-14667 823 Mandeville St Short Term Rental Residential Owner Residential Partial Unit Issued 8/1/23 2 4 Rachel Anthony Rachel Anthony 7/22/22 8/2/22
## 'data.frame':    89 obs. of  12 variables:
##  $ Permit.Number        : chr  "" "" "" "" ...
##  $ Address              : chr  "1717 Robert C Blakes, SR Dr" "1006 Race St" "2634 Louisiana Ave" "3323 Rosalie Aly" ...
##  $ Permit.Type          : chr  "Short Term Rental Commercial Owner" "Short Term Rental Commercial Owner" "Short Term Rental Commercial Owner" "Short Term Rental Residential Owner" ...
##  $ Residential.Subtype  : chr  "N/A" "N/A" "N/A" "Residential Partial Unit" ...
##  $ Current.Status       : chr  "Pending" "Pending" "Pending" "Pending" ...
##  $ Expiration.Date      : chr  "" "" "" "" ...
##  $ Bedroom.Limit        : int  5 5 3 1 3 1 2 2 1 2 ...
##  $ Guest.Occupancy.Limit: int  10 10 6 2 6 2 4 4 2 4 ...
##  $ Operator.Name        : chr  "Melissa Taranto" "Michael Heyne" "Michael Heyne" "Caroline Stas" ...
##  $ License.Holder.Name  : chr  "Scott Taranto" "Boutique Hospitality" "Resonance Home LLC" "Caroline Stas" ...
##  $ Application.Date     : chr  "8/9/22" "8/9/22" "8/9/22" "8/9/22" ...
##  $ Issue_Date           : chr  "" "" "" "" ...

 

There are a couple of important things to bear in mind about data frames.

  • These types of objects are known as rectangular data (or tidy data) as each column must have the same number of observations. Also, any missing data should be recorded as an NA just as we did with our vectors.
  • R tries to guess the data type based on sampling a few rows. More often than not, it gets it wrong. This is why I explicitly specified stringsAsFactors = FALSE to prevent reading them as factors. Ideally you want to store variables such as Current.Status and Permit.Type as factors and Address and Operator.Name as character.
  • Check to see how the date variables such as Issue_Date and Application.Date are read and stored. str() is often your friend.

 

Take a look at this video for a quick introduction to importing data in R

3.4.1 Saving files to import

The easiest method of creating a data file to import into R is to enter your data into a spreadsheet using either Microsoft Excel or LibreOffice Calc and save the spreadsheet as a tab delimited file. We prefer LibreOffice Calc as it’s open source, platform independent and free but MS Excel is OK too (but see here for some gotchas). I

There are a couple of things to bear in mind when saving files to import into R which will make your life easier in the long run.

  • Keep your column headings (if you have them) short and informative. Also avoid spaces in your column headings by replacing them with an underscore or a dot (i.e. replace opertator name with operator_name or operator.name)
  • Avoid using special characters (i.e. floor area (ft^2)).
  • Remember, if you have missing data in your data frame (empty cells) you should use an NA to represent these missing values. This will keep the data frame tidy.

3.4.2 Import functions

Once you’ve saved your data file in a suitable format we can now read this file into R. The workhorse function for importing data into R is the read.csv() function (we discuss some alternatives later in the chapter). The read.csv() function is a very flexible function with a shed load of arguments (see ?read.csv) but it’s quite simple to use.

Let’s import a tab delimited file called NOLA_str.txt which contains the data we saw previously in this Chapter and assign it to an object called str. The file is located in a data directory which itself is located in our root directory. The first row of the data contains the variable (column) names. To use the read.csv() function to import this file

nola_str <- read.csv(file = 'data/nola_STR.csv', header = TRUE,
                        stringsAsFactors = FALSE)

There are a few things to note about the above command. First, the file path and the filename (including the file extension) needs to be enclosed in either single or double quotes (i.e. the data/nola_STR.csv bit) as the read.csv() function expects this to be a character string. If your working directory is already set to the directory which contains the file, you don’t need to include the entire file path just the filename. In the example above, the file path is separated with a single forward slash /. This will work regardless of the operating system you are using and we recommend you stick with this. However, Windows users may be more familiar with the single backslash notation and if you want to keep using this you will need to include them as double backslashes. Note though that the double backslash notation will not work on computers using Mac OSX or Linux operating systems.

nola_str <- read.csv(file = 'C:\\Documents\\Prog1\\data\\nola_STR.csv', 
                      header = TRUE, stringsAsFactors = FALSE)

The header = TRUE argument specifies that the first row of your data contains the variable names (i.e. nitrogen, block etc). If this is not the case you can specify header = FALSE (actually, this is the default value so you can omit this argument entirely). The sep = "\t" argument tells R that the file delimiter is a tab (\t).

After importing our data into R it doesn’t appear that R has done much, at least nothing appears in the R Console! To see the contents of the data frame we could just type the name of the object as we have done previously. BUT before you do that, think about why you’re doing this. If your data frame is anything other than tiny, all you’re going to do is fill up your Console with data. It’s not like you can easily check whether there are any errors or that your data has been imported correctly. A much better solution is to use our old friend the str() function to return a compact and informative summary of your data frame.

str(nola_str) 
## 'data.frame':    89 obs. of  12 variables:
##  $ Permit.Number        : chr  "" "" "" "" ...
##  $ Address              : chr  "1717 Robert C Blakes, SR Dr" "1006 Race St" "2634 Louisiana Ave" "3323 Rosalie Aly" ...
##  $ Permit.Type          : chr  "Short Term Rental Commercial Owner" "Short Term Rental Commercial Owner" "Short Term Rental Commercial Owner" "Short Term Rental Residential Owner" ...
##  $ Residential.Subtype  : chr  "N/A" "N/A" "N/A" "Residential Partial Unit" ...
##  $ Current.Status       : chr  "Pending" "Pending" "Pending" "Pending" ...
##  $ Expiration.Date      : chr  "" "" "" "" ...
##  $ Bedroom.Limit        : int  5 5 3 1 3 1 2 2 1 2 ...
##  $ Guest.Occupancy.Limit: int  10 10 6 2 6 2 4 4 2 4 ...
##  $ Operator.Name        : chr  "Melissa Taranto" "Michael Heyne" "Michael Heyne" "Caroline Stas" ...
##  $ License.Holder.Name  : chr  "Scott Taranto" "Boutique Hospitality" "Resonance Home LLC" "Caroline Stas" ...
##  $ Application.Date     : chr  "8/9/22" "8/9/22" "8/9/22" "8/9/22" ...
##  $ Issue_Date           : chr  "" "" "" "" ...

Here we see that nola_str is a ‘data.frame’ object which contains 89 rows and 13 variables (columns). Each of the variables are listed along with their data class and the first 10 values. As we mentioned previously in this Chapter, it can be quite convenient to copy and paste this into your R script as a comment block for later reference.

From R version 4.0.0 you can just leave out this argument as stringsAsFactors = FALSE is the default.

Other useful arguments include dec = and na.strings =. The dec = argument allows you to change the default character (.) used for a decimal point. This is useful if you’re in a country where decimal places are usually represented by a comma (i.e. dec = ","). The na.strings = argument allows you to import data where missing values are represented with a symbol other than NA. This can be quite common if you are importing data from other statistical software such as Minitab which represents missing values as a * (na.strings = "*").

If we just wanted to see the names of our variables (columns) in the data frame we can use the names() function which will return a character vector of the variable names.

names(str)
## NULL

read.csv is basically a thin wrapper around the workhorse function, read.table. R has a number of variants of the read.table() function that you can use to import a variety of file formats. Actually, these variants just use the read.table() function but include different combinations of arguments by default to help import different file types. Similar to read.csv(), you can also use read.csv2() and read.delim() functions. The read.csv() function is used to import comma separated value (.csv) files and assumes that the data in columns are separated by a comma (it sets sep = "," by default). It also assumes that the first row of the data contains the variable names by default (it sets header = TRUE by default). The read.csv2() function assumes data are separated by semicolons and that a comma is used instead of a decimal point (as in many European countries). The read.delim() function is used to import tab delimited data and also assumes that the first row of the data contains the variable names by default.

# import .csv file
nola_str <- read.csv(file = 'data/nola_STR.csv') 

# import .csv file with dec = "," and sep = ";"
nola_str <- read.csv2(file = 'data/nola_STR.csv') 

# import tab delim file with sep = "\t"
nola_str <- read.delim(file = 'data/nola_STR.txt') 

You can even import spreadsheet files from MS Excel or other statistics software, using packages, directly into R but our advice is that this should generally be avoided if possible as it just adds a layer of uncertainty between you and your data. In our opinion it’s almost always better to export your spreadsheets as tab or comma delimited files and then import them into R using the read.table() function. If you’re hell bent on directly importing data from other software you will need to install the foreign package which has functions for importing Minitab, SPSS, Stata and SAS files or the xlsx package to import Excel spreadsheets.

3.4.3 Common import frustrations

It’s quite common to get a bunch of really frustrating error messages when you first start importing data into R. Perhaps the most common is

Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'nola_STR.csv': No such file or directory

This error message is telling you that R cannot find the file you are trying to import. It usually rears its head for one of a couple of reasons (or all of them!). The first is that you’ve made a mistake in the spelling of either the filename or file path. Another common mistake is that you have forgotten to include the file extension in the filename (i.e. .txt). Lastly, the file is not where you say it is or you’ve used an incorrect file path. Using RStudio Projects and having a logical directory structure goes along way to avoiding these types of errors.

3.4.4 Other import options

There are numerous other functions to import data from a variety of sources and formats. Most of these functions are contained in packages that you will need to install before using them. We list a couple of the more useful packages and functions below.

The fread() function from the read.table package is great for importing large data files quickly and efficiently (much faster than the read.table() function). One of the great things about the fread() function is that it will automatically detect many of the arguments you would normally need to specify (like sep = etc). One of the things you will need to consider though is that the fread() function will return a data.table object not a data.frame as would be the case with the read.table() function. This is usually not a problem as you can pass a data.table object to any function that only accepts data.frame objects. To learn more about the differences between data.table and data.frame objects see here.

library(read.table)
all_data <- fread(file = 'data/nola_str.txt')

Various functions from the readr package are also very efficient at reading in large data files. The readr package is part of the ‘tidyverse’ collection of packages and provides many equivalent functions to base R for importing data. The readr functions are used in a similar way to the read.table() or read.csv() functions and many of the arguments are the same (see ?readr::read_table for more details). There are however some differences. For example, when using the read_table() function the header = TRUE argument is replaced by col_names = TRUE and the function returns a tibble class object which is the tidyverse equivalent of a data.frame object (see here for differences).

In PLAN 672, we almost exclusively use functions from readr packages. I strongly recommend it.

library(readr)
# import white space delimited files
all_data <- read_table(file = 'data/nola_STR.txt', col_names = TRUE)

# import comma delimited files
all_data <- read_csv(file = 'data/nola_STR.txt')

# import tab delimited files
all_data <- read_delim(file = 'data/nola_STR.txt', delim = "\t")

# or use
all_data <- read_tsv(file = 'data/nola_STR.txt')

If your data file is ginormous, then the ff and bigmemory packages may be useful as they both contain import functions that are able to store large data in a memory efficient manner. You can find out more about these functions here and here.