3.2 Complicated Data Types

There are lots of complicated data types in R. But the two we will be faced with frequently are time/date and factors.

3.2.1 Dealing with Date and Time

Often dates and time pose significant problems to students, because of the complicated background that undergirds them to make them legible. For example,

date_of_collection <- 2022-08-31

date_of_collection
## [1] 1983

class(date_of_collection)
## [1] "numeric"

Because - is typically interpreted as subtraction rather than as an en dash or a hyphen, R treats is as a calculator. You can get around it by enclosing in quotes and converting to character, but that looses the ability to do arithmetic

date_of_collection <- "2022-08-31"

class(date_of_collection)
# > [1] "Character"

date_of_collection + 2
# > Error in date_of_collection + 2 : non-numeric argument to binary operator

You will have to explicitly specify that you want to store the date as a type. I strongly recommend using the lubridate package

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

date_of_collection <- as_date("2022-08-31")

class(date_of_collection)
## [1] "Date"

Sys.Date() - date_of_collection
## Time difference of -11 days

In R, dates are represented as the number of days since 1970-01-01.

unclass(date_of_collection)
## [1] 19235

Time is even more complicated than dates, as times need to have time zones associated with them as well as potential offsets such as daylight savings. In particular, I recommend, the ymd_hms type functions in lubridate to convert the character into a time object. More on this later.


time_of_collection <- ymd_hms("2010-12-13 15:30:30")
time_of_collection
## [1] "2010-12-13 15:30:30 UTC"


# Changes printing
with_tz(time_of_collection, "America/Los_Angeles")
## [1] "2010-12-13 07:30:30 PST"


# Changes time
force_tz(time_of_collection, "America/Chicago")
## [1] "2010-12-13 15:30:30 CST"

Time Zones are complicated. Over the last 100 years places have changed their affiliation between major time zones, have opted out of (or in to) daylight savings time (DST) in various years or adopted DST rule changes late or not at all. (The UK experimented with DST throughout 1971, only.) In a few countries (one is the Irish Republic) it is the summer time which is the ‘standard’ time and a different name is used in winter. And there can be multiple changes during a year, for example for Ramadan. These should be documented as part of the data collection and metadata processes.

3.2.2 Factors

R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Factors are also helpful for reordering character vectors to improve display. Factors are stored as integers rather than characters. For example, you can have day of the week as a factor variable.

day_of_week <- c('M', 'M', 'T', 'TH', 'W', 'SA', 'SU', 'TH')

(day_of_week <- factor(day_of_week))
## [1] M  M  T  TH W  SA SU TH
## Levels: M SA SU T TH W

Components of a factor can be modified using simple assignments. However values outside of its predefined levels are not permitted. Instead need to modify the levels first.

day_of_week[3] <- 'W'

day_of_week
## [1] M  M  W  TH W  SA SU TH
## Levels: M SA SU T TH W

day_of_week[4] <- 'F'
## Warning in `[<-.factor`(`*tmp*`, 4, value = "F"): invalid factor level, NA
## generated

levels(day_of_week) <- c(levels(day_of_week), "F")    # add new level
day_of_week[4] <- 'F'

str(day_of_week)
##  Factor w/ 7 levels "M","SA","SU",..: 1 1 6 7 6 2 3 5

Reordering the levels is often useful, especially for plotting purposes.

day_of_week <- c('M', 'M', 'T', 'TH', 'W', 'SA', 'SU', 'TH') %>% factor

str(day_of_week)
##  Factor w/ 6 levels "M","SA","SU",..: 1 1 4 5 6 2 3 5

day_of_week <- factor(day_of_week, levels=c("SU", "M", "T", "W", "TH", "F", "SA"))

str(day_of_week)
##  Factor w/ 7 levels "SU","M","T","W",..: 2 2 3 5 4 7 1 5