3.2 Complicated Data Types
There are lots of complicated data types in R. But the two we will be faced with frequently are time/date and factors.
3.2.1 Dealing with Date and Time
Often dates and time pose significant problems to students, because of the complicated background that undergirds them to make them legible. For example,
<- 2022-08-31
date_of_collection
date_of_collection## [1] 1983
class(date_of_collection)
## [1] "numeric"
Because -
is typically interpreted as subtraction rather than as an en dash or a hyphen, R treats is as a calculator. You can get around it by enclosing in quotes and converting to character, but that looses the ability to do arithmetic
<- "2022-08-31"
date_of_collection
class(date_of_collection)
# > [1] "Character"
+ 2
date_of_collection # > Error in date_of_collection + 2 : non-numeric argument to binary operator
You will have to explicitly specify that you want to store the date as a type. I strongly recommend using the lubridate
package
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
<- as_date("2022-08-31")
date_of_collection
class(date_of_collection)
## [1] "Date"
Sys.Date() - date_of_collection
## Time difference of -11 days
In R, dates are represented as the number of days since 1970-01-01.
unclass(date_of_collection)
## [1] 19235
Time is even more complicated than dates, as times need to have time zones associated with them as well as potential offsets such as daylight savings. In particular, I recommend, the ymd_hms
type functions in lubridate to convert the character into a time object. More on this later.
<- ymd_hms("2010-12-13 15:30:30")
time_of_collection
time_of_collection## [1] "2010-12-13 15:30:30 UTC"
# Changes printing
with_tz(time_of_collection, "America/Los_Angeles")
## [1] "2010-12-13 07:30:30 PST"
# Changes time
force_tz(time_of_collection, "America/Chicago")
## [1] "2010-12-13 15:30:30 CST"
Time Zones are complicated. Over the last 100 years places have changed their affiliation between major time zones, have opted out of (or in to) daylight savings time (DST) in various years or adopted DST rule changes late or not at all. (The UK experimented with DST throughout 1971, only.) In a few countries (one is the Irish Republic) it is the summer time which is the ‘standard’ time and a different name is used in winter. And there can be multiple changes during a year, for example for Ramadan. These should be documented as part of the data collection and metadata processes.
3.2.2 Factors
R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Factors are also helpful for reordering character vectors to improve display. Factors are stored as integers rather than characters. For example, you can have day of the week as a factor variable.
<- c('M', 'M', 'T', 'TH', 'W', 'SA', 'SU', 'TH')
day_of_week
<- factor(day_of_week))
(day_of_week ## [1] M M T TH W SA SU TH
## Levels: M SA SU T TH W
Components of a factor can be modified using simple assignments. However values outside of its predefined levels are not permitted. Instead need to modify the levels first.
3] <- 'W'
day_of_week[
day_of_week## [1] M M W TH W SA SU TH
## Levels: M SA SU T TH W
4] <- 'F'
day_of_week[## Warning in `[<-.factor`(`*tmp*`, 4, value = "F"): invalid factor level, NA
## generated
levels(day_of_week) <- c(levels(day_of_week), "F") # add new level
4] <- 'F'
day_of_week[
str(day_of_week)
## Factor w/ 7 levels "M","SA","SU",..: 1 1 6 7 6 2 3 5
Reordering the levels is often useful, especially for plotting purposes.
<- c('M', 'M', 'T', 'TH', 'W', 'SA', 'SU', 'TH') %>% factor
day_of_week
str(day_of_week)
## Factor w/ 6 levels "M","SA","SU",..: 1 1 4 5 6 2 3 5
<- factor(day_of_week, levels=c("SU", "M", "T", "W", "TH", "F", "SA"))
day_of_week
str(day_of_week)
## Factor w/ 7 levels "SU","M","T","W",..: 2 2 3 5 4 7 1 5