4.1 Data Frames & Tibbles

We already saw how rectangular data is stored in data frames in the previous Chapter. In contrast, Tidyverse uses tibbles.

A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist). This forces you to confront problems earlier, typically leading to cleaner, more expressive code.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ lubridate::date()        masks base::date()
## ✖ tidyr::extract()         masks magrittr::extract()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ lubridate::intersect()   masks base::intersect()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ purrr::set_names()       masks magrittr::set_names()
## ✖ lubridate::setdiff()     masks base::setdiff()
## ✖ lubridate::union()       masks base::union()

data_tib <- tibble(
  `alphabet soup` = letters,
  `nums ints` = 1:26,
  `sample ints` = sample(100, 26)
)

data_df <- data.frame(
  `alphabet soup` = letters,
  `nums ints` = 1:26,
  `sample ints` = sample(100, 26)
)

glimpse(data_tib)
## Rows: 26
## Columns: 3
## $ `alphabet soup` <chr> "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k",…
## $ `nums ints`     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ `sample ints`   <int> 46, 84, 69, 6, 42, 80, 25, 61, 56, 11, 16, 52, 99, 81,…
glimpse(data_df)
## Rows: 26
## Columns: 3
## $ alphabet.soup <chr> "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "…
## $ nums.ints     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ sample.ints   <int> 19, 40, 87, 49, 13, 92, 38, 4, 59, 96, 32, 62, 43, 46, 3…

# Notice the use of glimpse instead of str. They are very similar functions.

Notice how data.frame changes the names of the variables because it does not like spaces in the column names. One advantage of tibbles is that columns need not be valid R variable names as long as they are enclosed in ticks.

You can use base R functions to work with tibbles, because tibble is indeed a data frame. However, functions based on tibbles may not work with data frames.

data_tib[, 3]
## # A tibble: 26 × 1
##    `sample ints`
##            <int>
##  1            46
##  2            84
##  3            69
##  4             6
##  5            42
##  6            80
##  7            25
##  8            61
##  9            56
## 10            11
## # … with 16 more rows

data_tib[2:4, 1:3]
## # A tibble: 3 × 3
##   `alphabet soup` `nums ints` `sample ints`
##   <chr>                 <int>         <int>
## 1 b                         2            84
## 2 c                         3            69
## 3 d                         4             6