Visualising Data with Grammar of Graphics

Nikhil Kaza

2024-08-18

Recap from yesterday

  • Installation R/Rstudio
  • Different data types in rectangular datasets
  • Used functions from tidyverse packages to explore AQI data
  • Good data analysis habits
  • Asking for help.
  • Joining, mutating, filtering datasets

Today’s plan

  • Visualisation principles
  • Grammar of graphics
  • Visualisation with ggplot
  • Advanced techniques

Visualisation Principles

Focus attention by highlighting

Focus attention by stripping

Source: Zan Armstrong (2021) @ Observable

Order matters

Source: Zan Armstrong (2021) @ Observable

Avoid legends

Legends rely on visual association, which is hard.

Source: https://headwaterseconomics.org/dataviz/west-wide-atlas/

Channels - Data

Source: Munzer (2014)

Gestalt Principles

Small multiples

Source: Wilke (2019)

Compound figures

Source: Onda et. al (2020)

Spatial is not special

This map shows approximately 3,000 locations of Walmart stores. The hexagon area represents the number of stores in the vicinity, while the color represents the median age of these stores. Older stores are red, and newer stores are blue
Source: Mike Bostock @ Observable https://observablehq.com/@d3/hexbin-map

Accuracy is overrated

Source: Deetz and Adams (1921)

Precision even more so!

Source: Few (2017) https://www.perceptualedge.com/blog/?p=2596

Resources

Resources

Resources

Grammar of Graphics

Grammar of Graphics

ggplot2

ggplot2

Source: Healey (2018)

Framework

Mapping data (Channels)

Encoding data into visual cues to highlight comparisons

A non-exhaustive list

  • length
  • position
  • size
  • shape
  • area
  • angle
  • colour

Visualisation with ggplot in R

Recall the dataset

Code
library(tidyverse)
library(here)
# Sys.setlocale("LC_TIME", "en_US.UTF-8") 
# Uncomment above line, if computer's locale is not in US.

AQI_tbl <- here("data", "raw", "AQI_March_Joint.csv") %>% 
    read_csv()

AQI_tbl <- AQI_tbl %>% 
  mutate(date_collected = ymd(date_collected)) %>%
  mutate(month_of_obs = month(date_collected, label =T)) %>% 
    mutate(AQI = as.numeric(AQI)) %>%
    mutate(airquality = case_when(
           AQI <= 50 ~ "Good",
           AQI > 50 & AQI <= 100 ~ "Moderate",
           AQI > 100 & AQI <=  150 ~ "Unhealthy for Sensitive Groups",
           AQI > 150 & AQI <=  200 ~ "Unhealthy",
           AQI > 200 & AQI <=  300 ~ "Very Unhealthy",
           AQI > 300  ~ "Hazardous",
           .default = "NA")
           ) %>%
    mutate(sensor_ID = as.character(sensor_ID),
           airquality = factor(airquality, levels = c("Good", "Moderate", "Unhealthy for Sensitive Groups", "Unhealthy", "Very Unhealthy", "Hazardous"))
           )

Build it by layer

Code
AQI_tbl %>%
    ggplot() 

Initialises a ggplot object. Prints nothing.

Build it by layer

Code
AQI_tbl %>%
    ggplot() +
    geom_point(aes(x=date_collected, y=AQI, color = airquality)) +
    xlab("Collection Date")

  • Map date_collected to x scale (position), AQI to y scale (position), airquality to colour scale (hue).
  • The default statistic is identity.
  • We use these statistics and scales to map on to point geometry
  • All other aspects of the visualisation have sensible defaults (such as Cartesian coordinates)

Build it by layer

Code
AQI_tbl %>%
    ggplot() +
    geom_point(aes(x=date_collected, y=AQI, color = airquality)) +
    scale_color_brewer(palette = "Set3") +
    scale_y_log10() +
    xlab('Collection Date')

  • Changes the default colour scale to a different palette (only shown for illustration)
  • Changes the y position scale to log10. (only shown for illustration)

Color Palettes (A digression)

Color Palettes (A digression)

Beware of Color Blindness (A digression)

Color Scales

Green Blindness

Beware of Color Blindness (A digression)

Color Scales

Red Blindness

Beware of Color Blindness (A digression)

Color Scales

Blue Blindness

Beware of Color Blindness (A digression)

Color Scales

Desaturated

Build it by layer

Code
AQI_tbl %>%
    ggplot() +
    geom_point(aes(x=date_collected, y=AQI, color = airquality)) +
    geom_hline(yintercept = 100) +
    geom_hline(yintercept = 150) +
    geom_hline(yintercept = 200) +
    geom_hline(yintercept = 300) +
    xlab('Collection Date')

Build it by layer

Code
AQI_tbl %>%
    ggplot() +
    geom_point(aes(x=date_collected, y=AQI)) +
    geom_hline(yintercept = 100) +
    geom_hline(yintercept = 150) +
     geom_hline(yintercept = 200) +
     geom_hline(yintercept = 300) +
    facet_wrap(~omb, nrow =2)+
    xlab('Collection Date')

Experiment with different geoms

Code
AQI_tbl %>%
    ggplot() +
    geom_boxplot(aes(x=date_collected, y = AQI, group = date_collected)) +
    geom_hline(yintercept = 100) +
    geom_hline(yintercept = 150) +
     geom_hline(yintercept = 200) +
     geom_hline(yintercept = 300) +
    facet_wrap(~omb, nrow =2)+
    xlab('Collection Date')

Experiment with different geoms

Code
AQI_tbl %>%
    ggplot() +
    geom_smooth(aes(x=date_collected, y=AQI, group = sensor_ID, color = omb), method = "loess", se = FALSE) +
    xlab("Collection Date") + ylab("AQI")
  • Should the focus be on the average trend?

Or should it be on the outliers?

Code
AQI_tbl %>%
    mutate(week_date = week(date_collected)) %>%
    group_by(sensor_ID, week_date) %>%
    mutate(outlier_yn = rstatix::is_outlier(AQI))%>%
    ungroup()%>%
    filter(sensor_ID == 36451) %>%
    ggplot()+
    geom_smooth(aes(x=date_collected, y=AQI), method = loess, se = FALSE) +
    geom_point(aes(x=date_collected, y=AQI, color = outlier_yn), size = 2)+
    scale_color_manual(values=c('gray', 'red'), guide = "none") +
    labs(title = "sensor ID 36451") +
    xlab("Collection Date") + ylab("AQI")

Experiment with different mappings

Code
AQI_tbl %>%
    filter(sensor_ID == 36451) %>%
    ggplot() +
    geom_bar(aes(x=month_of_obs, fill=airquality), 
             stat = "count", position = "dodge") +
    scale_fill_viridis_d()+
    xlab("Month") + ylab("Count")
  • Should we be using fill colours?

Experiment with position

Code
AQI_tbl %>%
    filter(sensor_ID == 36451) %>%
    ggplot() +
    geom_bar(aes(x=month_of_obs, fill=airquality), 
             position = "identity", stat = "count") +
    scale_fill_brewer(palette = "Set2")+
    xlab("") + ylab("Count")

Experiment with different mappings

Code
AQI_tbl %>%
    mutate(week_date = week(date_collected)) %>%
    filter(sensor_ID == 36451)  %>%
    ggplot() +
    geom_point(aes(x=week_date, shape=airquality), 
             stat = "count") +
    xlab("Week #") + ylab("Count")

Why is this a bad visualisation? (Hint: Gestalt principles)

Thank you!