Working with Data in R

Nihil Kaza

2024-08-17

Agenda


  1. Data Types

Data Types

1 2 3 4 5

Basic Data types

1 2 3 4 5

  • Numeric data are numbers that contain a decimal. e.g. 2.5, 2.0

  • Integers are whole numbers e.g. 2L, 5L

  • Logical : TRUE or FALSE or NA.

  • Character data are used to represent string values (e.g. “test”)

    • Factor, with levels and/order (e.g. “Metropolitan”, “Micropolitan”)
  • Complex (Not Relevant)

  • Raw (Not Relevant)

Basic Data types

1 2 3 4 5

int <- 2L # Need to include L to specify integer
class(int)
[1] "integer"
num <- 2
class(num)
[1] "numeric"
identical(num, int)
[1] FALSE
all.equal(num, int)
[1] TRUE
char <- "hello"
class(char)
[1] "character"
logi <- NA
class(logi)
[1] "logical"
my_vec <- c(2, 2L, 3.5, 5)
class(my_vec)
[1] "numeric"

Basic Data types

1 2 3 4 5

my_vec[4] <- NA
my_vec
[1] 2.0 2.0 3.5  NA
class(my_vec)
[1] "numeric"
my_vec <- c(T, T, F, NA)
class(my_vec)
[1] "logical"
my_vec <- c("T", "T", "F", "NA")
class(my_vec)
[1] "character"

Complicated Data types - Time & Date

1 2 3 4 5

date_of_collection <- 2022-08-31
date_of_collection
[1] 1983
date_of_collection <- "2022-08-31"
class(date_of_collection)
[1] "character"
time_of_collection <- "2010-12-13 15:30:30"

If you want to use dates use lubridate package. More in PLAN 672.

Complicated Data types - Factors

1 2 3 4 5

day_of_week <- c("M", "M", "T", "TH", "W", "SA", "SU", "TH")

class(day_of_week)
[1] "character"
day_of_week <- factor(day_of_week)
day_of_week
[1] M  M  T  TH W  SA SU TH
Levels: M SA SU T TH W
day_of_week[3] <- NA

day_of_week
[1] M    M    <NA> TH   W    SA   SU   TH  
Levels: M SA SU T TH W
day_of_week[5] <- "F"

day_of_week
[1] M    M    <NA> TH   <NA> SA   SU   TH  
Levels: M SA SU T TH W

Complicated Data types - Factors

1 2 3 4 5

If you want to add new levels

levels(day_of_week) <- c(levels(day_of_week), "F")    # add new level
day_of_week[5] <- 'F'
day_of_week
[1] M    M    <NA> TH   F    SA   SU   TH  
Levels: M SA SU T TH W F

Often it is useful to reorder levels to follow convention

day_of_week <- c("M", "M", "T", "TH", "W", "SA", "SU", "TH")
day_of_week <- factor(day_of_week, levels=c("SU", "M", "T", "W", "TH", "F", "SA"))

str(day_of_week)
 Factor w/ 7 levels "SU","M","T","W",..: 2 2 3 5 4 7 1 5
# Or if you prefer to start your week on M

day_of_week <- factor(day_of_week, levels=c("M", "T", "W", "TH", "F", "SA", "SU"))

str(day_of_week)
 Factor w/ 7 levels "M","T","W","TH",..: 1 1 2 4 3 6 7 4

Data Structures

1 2 3 4 5

Scalars,Vectors, Matrices & Arrays

1 2 3 4 5

Scalars, Vectors, Matrices & Arrays

1 2 3 4 5

a <- c(2, 4, 4)
dim(a)
NULL
my_mat <- matrix(1:16, nrow = 4, byrow = TRUE)
my_mat
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
[4,]   13   14   15   16
dim(my_mat)
[1] 4 4
my_array <- array(LETTERS[1:16], dim = c(2, 4, 2))
my_array
, , 1

     [,1] [,2] [,3] [,4]
[1,] "A"  "C"  "E"  "G" 
[2,] "B"  "D"  "F"  "H" 

, , 2

     [,1] [,2] [,3] [,4]
[1,] "I"  "K"  "M"  "O" 
[2,] "J"  "L"  "N"  "P" 

Lists

1 2 3 4 5

my_list <- list(c("black", "yellow", "orange"),
               c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE),
               matrix(1:6, nrow = 3))
my_list
[[1]]
[1] "black"  "yellow" "orange"

[[2]]
[1]  TRUE  TRUE FALSE  TRUE FALSE FALSE

[[3]]
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
names(my_list) <- c("colours", "evaluation", "time") # Often you can change the names of list afterwards

Lists

1 2 3 4 5

To access the list elements use $ and [

my_list$colours
[1] "black"  "yellow" "orange"
my_list$evaluation[4:6]
[1]  TRUE FALSE FALSE

Data Frames are Lists!

1 2 3 4 5

  • Columns need to have the same length, unlike regular lists
p.height <- c(180, 155, 160, 167, 181)

p.names <- c("Joanna", "Charlotte", "Helen", "Karen", "Amy")
dataf <- data.frame(height = p.height, names = p.names)
dim(dataf)
[1] 5 2
dataf
  height     names
1    180    Joanna
2    155 Charlotte
3    160     Helen
4    167     Karen
5    181       Amy
dataf$names
[1] "Joanna"    "Charlotte" "Helen"     "Karen"     "Amy"      
dataf$height[4]
[1] 167
dataf[4,1]
[1] 167

Tibbles are Data Frames!

1 2 3 4 5

Exercise 3

Tidyverse

1 2 3 4 5

Using R with tidyverse

1 2 3 4 5



You use R via packages


…which contain functions


…which are just verbs


Tidyverse is a collection of R packages designed to work around a common philosophy.

Best way to learn the tidyverse

http://r4ds.had.co.nz/

Tibbles are Data Frames!

1 2 3 4 5

  • Put each dataset in a tibble
  • Put each variable in a column
  • Each column should be a data type (integer, logical, date, character etc.)
  • Each row is ‘nominally’ independent of another row
  • Each column is ‘nominally’ independent of another column

Tibbles are Data Frames!

1 2 3 4 5

library(tidyverse)

data_tib <- tibble(
  `alphabet soup` = letters,
  `nums ints` = 1:26,
  `sample ints` = sample(100, 26)
)

data_df <- data.frame(
  `alphabet soup` = letters,
  `nums ints` = 1:26,
  `sample ints` = sample(100, 26)
)

glimpse(data_tib)
Rows: 26
Columns: 3
$ `alphabet soup` <chr> "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k",…
$ `nums ints`     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ `sample ints`   <int> 61, 97, 91, 31, 59, 87, 49, 98, 37, 47, 95, 90, 75, 12…
glimpse(data_df)
Rows: 26
Columns: 3
$ alphabet.soup <chr> "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "…
$ nums.ints     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ sample.ints   <int> 8, 9, 66, 24, 28, 95, 79, 22, 62, 53, 97, 33, 81, 63, 40…

Reading in External Data

1 2 3 4 5

  • use read_csv function within the tidyverse
  • use here package
  • write your code to work on many different computers

First download and put the data into raw_data subdirectory in your project folder

Always here

1 2 3 4 5

Jenny Bryan once said

    If the first line of your R script is

    setwd("C:\Users\jenny\path\that\only\I\have")

    I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.
    
    

    If the first line of your R script is

    rm(list = ls())

    I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.

I will happily supply her the matches and fuel

Always here

1 2 3 4 5


instead use the here function in the here package

library(here)
here("data", "raw_data", "NOLA_STR.csv") 

here converts to appropriate file paths

[1] "/Users/kaza/Dropbox/shortcourses/data/raw_data/NOLA_STR.csv"
  • which allows me to change computers, Operating Systems etc.
  • don’t have to worry about “/” (Linux) or “\” (Windows)

Reading in External Data

1 2 3 4 5

library(tidyverse)
library(here)

nola_str_tib <- read_csv(here("data", "raw_data", "NOLA_STR.csv" ))
nol_str_tib
Permit Number Address Permit Type Residential Subtype Current Status Expiration Date Bedroom Limit Guest Occupancy Limit Operator Name License Holder Name Application Date Issue_Date
1717 Robert C Blakes, SR Dr Short Term Rental Commercial Owner N/A Pending 5 10 Melissa Taranto Scott Taranto 8/9/22
1006 Race St Short Term Rental Commercial Owner N/A Pending 5 10 Michael Heyne Boutique Hospitality 8/9/22
2634 Louisiana Ave Short Term Rental Commercial Owner N/A Pending 3 6 Michael Heyne Resonance Home LLC 8/9/22
3323 Rosalie Aly Short Term Rental Residential Owner Residential Partial Unit Pending 1 2 Caroline Stas Caroline Stas 8/9/22
1525 Melpomene St Short Term Rental Residential Owner Residential Small Unit Pending 3 6 Craig Redgrave Craig R Redgrave 8/8/22

Contains 89 observations with 12 variables.

Better Use Pipes!

1 2 3 4 5

# base R pipe

nola_str_tib <- here("data", "raw_data", "NOLA_STR.csv" ) |> 
                    read_csv()

# or if you prefer magrittr's pipe, as I do,

nola_str_tib <- here("data", "raw_data", "NOLA_STR.csv" ) %>% 
                    read_csv()

Exploratory Data Analysis

1 2 3 4 5

Explore the data!

1 2 3 4 5

str(nola_str_tib)
spc_tbl_ [89 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Permit Number        : chr [1:89] NA NA NA NA ...
 $ Address              : chr [1:89] "1717 Robert C Blakes, SR Dr" "1006 Race St" "2634 Louisiana Ave" "3323 Rosalie Aly" ...
 $ Permit Type          : chr [1:89] "Short Term Rental Commercial Owner" "Short Term Rental Commercial Owner" "Short Term Rental Commercial Owner" "Short Term Rental Residential Owner" ...
 $ Residential Subtype  : chr [1:89] "N/A" "N/A" "N/A" "Residential Partial Unit" ...
 $ Current Status       : chr [1:89] "Pending" "Pending" "Pending" "Pending" ...
 $ Expiration Date      : chr [1:89] NA NA NA NA ...
 $ Bedroom Limit        : num [1:89] 5 5 3 1 3 1 2 2 1 2 ...
 $ Guest Occupancy Limit: num [1:89] 10 10 6 2 6 2 4 4 2 4 ...
 $ Operator Name        : chr [1:89] "Melissa Taranto" "Michael Heyne" "Michael Heyne" "Caroline Stas" ...
 $ License Holder Name  : chr [1:89] "Scott Taranto" "Boutique Hospitality" "Resonance Home LLC" "Caroline Stas" ...
 $ Application Date     : chr [1:89] "8/9/22" "8/9/22" "8/9/22" "8/9/22" ...
 $ Issue_Date           : chr [1:89] NA NA NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   `Permit Number` = col_character(),
  ..   Address = col_character(),
  ..   `Permit Type` = col_character(),
  ..   `Residential Subtype` = col_character(),
  ..   `Current Status` = col_character(),
  ..   `Expiration Date` = col_character(),
  ..   `Bedroom Limit` = col_double(),
  ..   `Guest Occupancy Limit` = col_double(),
  ..   `Operator Name` = col_character(),
  ..   `License Holder Name` = col_character(),
  ..   `Application Date` = col_character(),
  ..   Issue_Date = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Explore the data!

1 2 3 4 5

summary(nola_str_tib)
 Permit Number        Address          Permit Type        Residential Subtype
 Length:89          Length:89          Length:89          Length:89          
 Class :character   Class :character   Class :character   Class :character   
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   
                                                                             
                                                                             
                                                                             
 Current Status     Expiration Date    Bedroom Limit   Guest Occupancy Limit
 Length:89          Length:89          Min.   :1.000   Min.   : 2.000       
 Class :character   Class :character   1st Qu.:1.000   1st Qu.: 2.000       
 Mode  :character   Mode  :character   Median :2.000   Median : 4.000       
                                       Mean   :2.056   Mean   : 4.112       
                                       3rd Qu.:2.000   3rd Qu.: 4.000       
                                       Max.   :5.000   Max.   :10.000       
 Operator Name      License Holder Name Application Date    Issue_Date       
 Length:89          Length:89           Length:89          Length:89         
 Class :character   Class :character    Class :character   Class :character  
 Mode  :character   Mode  :character    Mode  :character   Mode  :character  
                                                                             
                                                                             
                                                                             

Select

1 2 3 4 5

  • Pick Columns
nola_str_tib %>%
  select(c(`Permit Type`, `Residential Subtype`))
Permit Type Residential Subtype
Short Term Rental Commercial Owner N/A
Short Term Rental Commercial Owner N/A
Short Term Rental Commercial Owner N/A
Short Term Rental Residential Owner Residential Partial Unit
Short Term Rental Residential Owner Residential Small Unit
  • Recall [ function. Can you achieve similar result with it?

Filter

1 2 3 4 5

  • Pick Rows
nola_str_tib %>%
  filter(`Permit Type` == "Short Term Rental Residential Owner")
Permit Number Address Permit Type Residential Subtype Current Status Expiration Date Bedroom Limit Guest Occupancy Limit Operator Name License Holder Name Application Date Issue_Date
3323 Rosalie Aly Short Term Rental Residential Owner Residential Partial Unit Pending 1 2 Caroline Stas Caroline Stas 8/9/22
1525 Melpomene St Short Term Rental Residential Owner Residential Small Unit Pending 3 6 Craig Redgrave Craig R Redgrave 8/8/22
22-RSTR-15568 3112 Octavia St Short Term Rental Residential Owner Residential Partial Unit Issued 8/4/23 1 2 Philip Wheeler Philip Barrett Wheeler 8/5/22 8/5/22
  • Recall [ function. Can you achieve similar result with it?

Mutate

1 2 3 4 5

  • Change/Add Columns
library(lubridate)

nola_str_tib <-
    nola_str_tib %>%
    mutate(`Application Date` = mdy(`Application Date`))
Application Date
2022-08-09
2022-08-09
2022-08-09
2022-08-09
2022-08-08
  • What would be a base R way to do this?

Chain Them!

1 2 3 4 5

library(lubridate)

nola_str_tib %>%
  mutate(Backlogged = if_else(
    (today() - `Application Date` >= 15) & is.na(Issue_Date),
    T, F
  )) %>%
  filter(Backlogged == T) %>%
  select(`Address`, `Operator Name`, `License Holder Name`, `Application Date`)
Address Operator Name License Holder Name Application Date
1717 Robert C Blakes, SR Dr Melissa Taranto Scott Taranto 2022-08-09
1006 Race St Michael Heyne Boutique Hospitality 2022-08-09
2634 Louisiana Ave Michael Heyne Resonance Home LLC 2022-08-09
3323 Rosalie Aly Caroline Stas Caroline Stas 2022-08-09
1525 Melpomene St Craig Redgrave Craig R Redgrave 2022-08-08

Some callouts

1 2 3 4 5

  • No need to use $ in tibbles to access columns in tibbles within tidyverse.
  • When there are no spaces in the column names you can omit the ticks ``.
  • Be careful about column lengths when doing multiple chaining
  • Use () to specify the order of operations

Merge Data - Left

1 2 3 4 5

Merge Data - Right

1 2 3 4 5

Merge Data - Inner

1 2 3 4 5

Exercise 4

Sneak Peak

1 2 3 4 5

Data Visualisation

1 2 3 4 5

Thank You!