Spatial Data in R

Nikhil Kaza

5/4/23

Recap from last week

  • Grammar of Graphics
  • Aesthetics, Mapping & Geoms
  • Scales
  • Facetting
  • Spatial is not special

Today’s plan

  • Thinking about space
  • sf for storing and processing spatial objects
  • tmap for visualising spatial objects
  • Advanced R with sf
  • Further work

Thinking about space

Conceptual rethink

Spatial is just a data type (mostly…)

library(tidyverse)
library(here)

sm_tbl <-
    here("data", "processed", "smartcard_20150401_subway.csv") %>%
    read_csv()
  •   Close to 4.4 million trips
  •   2.4 million unique individuals (card ids)
  •   285 subway station locations
CardID tripid station_name_D station_name_O lon_D lon_O lat_D lat_O dat_time_D dat_time_O
5690 1 宜山路 星中路 121.4226 121.3643 31.18858 31.16001 2015-04-01 12:12:06 2015-04-01 11:52:57
5690 2 星中路 宜山路 121.3643 121.4226 31.16001 31.18858 2015-04-01 18:39:48 2015-04-01 18:24:04
5951 1 南京东路 新村路 121.4801 121.4180 31.24006 31.26578 2015-04-01 14:55:19 2015-04-01 14:24:06
5951 2 新村路 南京东路 121.4180 121.4801 31.26578 31.24006 2015-04-01 16:21:04 2015-04-01 15:50:06
6032 1 锦绣路 广兰路 121.5357 121.6169 31.18973 31.21325 2015-04-01 14:42:53 2015-04-01 14:17:33
6032 2 广兰路 锦绣路 121.6169 121.5357 31.21325 31.18973 2015-04-01 15:34:24 2015-04-01 15:09:45

Spatial is just a data type (mostly…)

library(tidyverse)
library(here)

sm_tbl <-
    here("data", "processed", "smartcard_20150401_subway.csv") %>%
    read_csv()
CardID tripid station_name_D station_name_O lon_D lon_O lat_D lat_O dat_time_D dat_time_O
5690 1 宜山路 星中路 121.4226 121.3643 31.18858 31.16001 2015-04-01 12:12:06 2015-04-01 11:52:57
5690 2 星中路 宜山路 121.3643 121.4226 31.16001 31.18858 2015-04-01 18:39:48 2015-04-01 18:24:04
5951 1 南京东路 新村路 121.4801 121.4180 31.24006 31.26578 2015-04-01 14:55:19 2015-04-01 14:24:06
5951 2 新村路 南京东路 121.4180 121.4801 31.26578 31.24006 2015-04-01 16:21:04 2015-04-01 15:50:06
6032 1 锦绣路 广兰路 121.5357 121.6169 31.18973 31.21325 2015-04-01 14:42:53 2015-04-01 14:17:33
6032 2 广兰路 锦绣路 121.6169 121.5357 31.21325 31.18973 2015-04-01 15:34:24 2015-04-01 15:09:45

Visualising spatial data

Code
library(ggplot2)
library(ggthemes)

numtrips <- sm_tbl %>%
    filter(station_name_O != station_name_D) %>%
  mutate(hr = hour(dat_time_O),
         time_of_day = hr %>% cut(breaks=c(0,6,10,16,20,24), include.lowest = TRUE, labels=c("Midnight - 6AM", "6AM - 10AM", "10AM - 4PM", "4PM - 8PM", "8PM - Midnight"))
         ) %>%
  group_by(station_name_O, station_name_D, time_of_day) %>%
  summarise(lon_O = first(lon_O),
            lat_O = first(lat_O),
            lon_D = first(lon_D),
            lat_D = first(lat_D),
            totaltrips = n()
            )



numtrips %>%
 ggplot()+
  geom_segment(aes(x=lon_O, y=lat_O,xend=lon_D, yend=lat_D, alpha=totaltrips))+
  #Here is the magic bit that sets line transparency - essential to make the plot readable
  scale_alpha_continuous(range = c(0.005, 0.01), guide='none')+
  facet_wrap(time_of_day~.)+
  #Set black background, ditch axes
  scale_x_continuous("", breaks=NULL)+
  scale_y_continuous("", breaks=NULL) +
    theme_tufte()

Visualising spatial data

Code
numtrips <- sm_tbl %>%
    filter(station_name_O != station_name_D) %>%
  mutate(hr = hour(dat_time_O),
         time_of_day = hr %>% cut(breaks=c(0,6,10,16,20,24), include.lowest = TRUE, labels=c("Midnight - 6AM", "6AM - 10AM", "10AM - 4PM", "4PM - 8PM", "8PM - Midnight"))
         ) %>%
  group_by(station_name_D, time_of_day) %>%
  summarise(lon_D = first(lon_D),
            lat_D = first(lat_D),
            totaltrips = n()
            )



numtrips %>%
 ggplot()+
  geom_point(aes(x=lon_D, y=lat_D, size=totaltrips))+
  scale_size_continuous(range = c(0.01, 2), guide='none')+
  facet_wrap(time_of_day~.)+
  #Set black background, ditch axes
  scale_x_continuous("", breaks=NULL)+
  scale_y_continuous("", breaks=NULL) +
    theme_tufte()+
    labs(main = "Popularity of destination stations on Apr 1, 2015")

But basemaps provide context

Code
library(ggmap)

bbox <- c(left = 120.9892, bottom = 30.7151, right = 122.0228, top = 31.4177)
g1 <- get_stamenmap(bbox, zoom = 10, maptype = "toner-hybrid", color = "bw")


ggmap(g1)+
    geom_point(aes(x=lon_D, y=lat_D, size=totaltrips), data = numtrips)+
  scale_size_continuous(range = c(0.01, 2), guide='none', trans = )+
  facet_wrap(time_of_day~.)+
  #Set black background, ditch axes
  scale_x_continuous("", breaks=NULL)+
  scale_y_continuous("", breaks=NULL) +
    theme_tufte()

Easier with tmap (but need to get serious)

Code
library(tmap)
tmap_mode("view")

sm_tbl %>%
    distinct(lon_D, lat_D) %>%
    drop_na %>%
    sf::st_as_sf(coords = c("lon_D", "lat_D"), crs = 4326) %>%
    tm_shape()+
    tm_dots(size = .01, col="red")+
    tm_basemap("Stamen.Toner")

Introducing sf

Spatial with sf

Spatial with sf

  • represents simple features as records with a geometry list-column
  • represents natively in R all 17 simple feature types for all dimensions (XY, XYZ, XYM, XYZM)
  • interfaces to GEOS (projected) and s2geometry (unprojected)
  • interfaces to GDAL, supporting all driver options
  • reads from and writes to spatial databases such as PostGIS using DBI

sf ecosystem

sf: Objects with simple features

library(sf)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
print(nc[9:15], n = 3)

Geometry

Polygons are complex (A digression)

Geometry is not that different from time

Promote a set of columns to a geometry column

scard_sf <- sm_tbl %>%
    drop_na(lon_D, lat_D) %>%
    sf::st_as_sf(coords = c("lon_D", "lat_D"), crs = 4326)
CardID tripid station_name_D station_name_O lon_O lat_O dat_time_D dat_time_O geometry
5690 1 宜山路 星中路 121.3643 31.16001 2015-04-01 12:12:06 2015-04-01 11:52:57 POINT (121.4226 31.18858)
5690 2 星中路 宜山路 121.4226 31.18858 2015-04-01 18:39:48 2015-04-01 18:24:04 POINT (121.3643 31.16001)
5951 1 南京东路 新村路 121.4180 31.26578 2015-04-01 14:55:19 2015-04-01 14:24:06 POINT (121.4801 31.24006)
5951 2 新村路 南京东路 121.4801 31.24006 2015-04-01 16:21:04 2015-04-01 15:50:06 POINT (121.418 31.26578)
6032 1 锦绣路 广兰路 121.6169 31.21325 2015-04-01 14:42:53 2015-04-01 14:17:33 POINT (121.5357 31.18973)
6032 2 广兰路 锦绣路 121.5357 31.18973 2015-04-01 15:34:24 2015-04-01 15:09:45 POINT (121.6169 31.21325)

Be careful with sticky geometry

scard_sf %>%
    filter(CardID == "6138") %>%
    select(station_name_O)
station_name_O geometry
虹桥火车站 POINT (121.4706 31.20117)
鲁班路 POINT (121.5686 31.25259)
云山路 POINT (121.4706 31.20117)
鲁班路 POINT (121.5776 31.25927)

Multiple sets of columns are good candidates for geometry

scard_sf <- sm_tbl %>%
                    sf::st_as_sf(coords = c("lon_D", "lat_D"), crs = 4326) 

scard_sf$orig_geom <- sm_tbl %>%
                    sf::st_as_sf(coords = c("lon_O", "lat_O"), crs = 4326) %>%
                    st_geometry()                   
CardID tripid geometry orig_geom
5690 1 POINT (121.4226 31.18858) POINT (121.3643 31.16001)
5690 2 POINT (121.3643 31.16001) POINT (121.4226 31.18858)
5951 1 POINT (121.4801 31.24006) POINT (121.418 31.26578)
5951 2 POINT (121.418 31.26578) POINT (121.4801 31.24006)
6032 1 POINT (121.5357 31.18973) POINT (121.6169 31.21325)
6032 2 POINT (121.6169 31.21325) POINT (121.5357 31.18973)

Create new geometries

scard_sf %>%
    mutate(buffer_geom = st_buffer(orig_geom, .05))
CardID orig_geom buffer_geom geometry
5690 POINT (121.3643 31.16001) POLYGON ((121.3643 31.16001... POINT (121.4226 31.18858)
5690 POINT (121.4226 31.18858) POLYGON ((121.4226 31.18858... POINT (121.3643 31.16001)
5951 POINT (121.418 31.26578) POLYGON ((121.418 31.26578,... POINT (121.4801 31.24006)
5951 POINT (121.4801 31.24006) POLYGON ((121.4801 31.24006... POINT (121.418 31.26578)
6032 POINT (121.6169 31.21325) POLYGON ((121.6169 31.21325... POINT (121.5357 31.18973)
6032 POINT (121.5357 31.18973) POLYGON ((121.5357 31.18973... POINT (121.6169 31.21325)

A reminder that (only active) geometry is sticky

scard_sf %>%
    select(station_name_D, dat_time_D) 
station_name_D dat_time_D geometry
宜山路 2015-04-01 12:12:06 POINT (121.4226 31.18858)
星中路 2015-04-01 18:39:48 POINT (121.3643 31.16001)
南京东路 2015-04-01 14:55:19 POINT (121.4801 31.24006)
新村路 2015-04-01 16:21:04 POINT (121.418 31.26578)

Use plyrs

scard_sf %>%
   filter(station_name_O != station_name_D) %>%
  mutate(hr = hour(dat_time_O),
         time_of_day = hr %>% cut(breaks=c(0,6,10,16,20,24), include.lowest = TRUE, labels=c("Midnight - 6AM", "6AM - 10AM", "10AM - 4PM", "4PM - 8PM", "8PM - Midnight"))
         )
CardID tripid hr time_of_day geometry
5690 1 11 10AM - 4PM POINT (121.4226 31.18858)
5690 2 18 4PM - 8PM POINT (121.3643 31.16001)
5951 1 14 10AM - 4PM POINT (121.4801 31.24006)
5951 2 15 10AM - 4PM POINT (121.418 31.26578)
6032 1 14 10AM - 4PM POINT (121.5357 31.18973)
6032 2 15 10AM - 4PM POINT (121.6169 31.21325)

… but worry about summarisation

scard_sf %>%
   filter(station_name_O != station_name_D) %>%
  mutate(hr = hour(dat_time_O),
         time_of_day = hr %>% cut(breaks=c(0,6,10,16,20,24), include.lowest = TRUE, labels=c("Midnight - 6AM", "6AM - 10AM", "10AM - 4PM", "4PM - 8PM", "8PM - Midnight"))
         ) %>%
    group_by(time_of_day) %>%
    summarise(totaltrips = n())
totaltrips time_of_day geometry
3 Midnight - 6AM MULTIPOINT ((121.4111 31.21...
10 6AM - 10AM MULTIPOINT ((121.3331 31.15...
18 10AM - 4PM MULTIPOINT ((121.3807 31.11...
18 4PM - 8PM MULTIPOINT ((121.7573 31.05...

Drop it!

scard_sf %>%
    st_drop_geometry()
CardID tripid station_name_D station_name_O lon_O lat_O dat_time_D dat_time_O orig_geom
5690 1 宜山路 星中路 121.3643 31.16001 2015-04-01 12:12:06 2015-04-01 11:52:57 POINT (121.3643 31.16001)
5690 2 星中路 宜山路 121.4226 31.18858 2015-04-01 18:39:48 2015-04-01 18:24:04 POINT (121.4226 31.18858)
5951 1 南京东路 新村路 121.4180 31.26578 2015-04-01 14:55:19 2015-04-01 14:24:06 POINT (121.418 31.26578)
5951 2 新村路 南京东路 121.4801 31.24006 2015-04-01 16:21:04 2015-04-01 15:50:06 POINT (121.4801 31.24006)

Particularly useful in summarisation

scard_sf %>%
   filter(station_name_O != station_name_D) %>%
  mutate(hr = hour(dat_time_O),
         time_of_day = hr %>% cut(breaks=c(0,6,10,16,20,24), include.lowest = TRUE, labels=c("Midnight - 6AM", "6AM - 10AM", "10AM - 4PM", "4PM - 8PM", "8PM - Midnight"))
         ) %>%
    st_drop_geometry() %>%
    group_by(time_of_day) %>%
    summarise(totaltrips = n())
time_of_day totaltrips
Midnight - 6AM 193219
6AM - 10AM 1576447
10AM - 4PM 1044716
4PM - 8PM 1345731
8PM - Midnight 188404

Visualisation with tmap

tmap again

Code
tmap_mode("plot")

numtrips <- sm_tbl %>%
    filter(station_name_O != station_name_D) %>%
  mutate(hr = hour(dat_time_O),
         time_of_day = hr %>% cut(breaks=c(0,6,10,16,20,24), include.lowest = TRUE, labels=c("Midnight - 6AM", "6AM - 10AM", "10AM - 4PM", "4PM - 8PM", "8PM - Midnight"))
         ) %>%
  group_by(station_name_D, time_of_day) %>%
  summarise(lon_D = first(lon_D),
            lat_D = first(lat_D),
            totaltrips = n()
            ) %>%
    st_as_sf(coords = c("lon_D", "lat_D"), crs=4326) 


numtrips %>%
    tm_shape()+
    tm_bubbles(size = "totaltrips", 
            col="red",
            border.col = "red",
            title.size = "# of trips",
            legend.size.is.portrait = TRUE
            )+
    tm_facets(by = "time_of_day") +
     tm_layout(
            legend.outside.size = 0.2,
            main.title= 'Popularity of trip destination', 
            main.title.position = c("center", 'top'))

tmap can create interactive maps

Code
tmap_mode('view')

opts <- tmap_options(basemaps = c(Canvas = c("Esri.WorldGrayCanvas","Stamen.Toner"), 
                                             Imagery = "Esri.WorldImagery"),
    overlays = c(Labels = paste0("http://services.arcgisonline.com/arcgis/rest/services/Canvas/",
                                 "World_Light_Gray_Reference/MapServer/tile/{z}/{y}/{x}")))

numtrips %>%
    filter(time_of_day == "4PM - 8PM") %>%
    tm_shape(name = "Stations")+
    tm_bubbles(size = "totaltrips", 
            col="red",
            border.col = "red",
            alpha = .5
            ) 

Advanced R with sf

Perhaps the geometry needs to be different

scard_geom <- sm_tbl %>%
    select(lon_O, lat_O, lon_D, lat_D) %>% 
    split(seq(nrow(.))) %>%
    map(function(row){unlist(row) %>% matrix(ncol = 2, byrow=T) %>% st_linestring()}) %>%
    st_as_sfc(crs = 4326)

scard_sf_line <- st_sf(sm_tbl, scard_geom)
CardID station_name_O station_name_D geometry
5690 星中路 宜山路 LINESTRING (121.3643 31.160...
5690 宜山路 星中路 LINESTRING (121.4226 31.188...
5951 新村路 南京东路 LINESTRING (121.418 31.2657...
5951 南京东路 新村路 LINESTRING (121.4801 31.240...
6032 广兰路 锦绣路 LINESTRING (121.6169 31.213...
6032 锦绣路 广兰路 LINESTRING (121.5357 31.189...

Let’s break it down (1)

sm_tbl %>%
    select(lon_O, lat_O, lon_D, lat_D) %>% 
    split(seq(nrow(.)))
$`1`
# A tibble: 1 × 4
  lon_O lat_O lon_D lat_D
  <dbl> <dbl> <dbl> <dbl>
1  121.  31.2  121.  31.2

$`2`
# A tibble: 1 × 4
  lon_O lat_O lon_D lat_D
  <dbl> <dbl> <dbl> <dbl>
1  121.  31.2  121.  31.2

$`3`
# A tibble: 1 × 4
  lon_O lat_O lon_D lat_D
  <dbl> <dbl> <dbl> <dbl>
1  121.  31.3  121.  31.2

$`4`
# A tibble: 1 × 4
  lon_O lat_O lon_D lat_D
  <dbl> <dbl> <dbl> <dbl>
1  121.  31.2  121.  31.3

$`5`
# A tibble: 1 × 4
  lon_O lat_O lon_D lat_D
  <dbl> <dbl> <dbl> <dbl>
1  122.  31.2  122.  31.2

$`6`
# A tibble: 1 × 4
  lon_O lat_O lon_D lat_D
  <dbl> <dbl> <dbl> <dbl>
1  122.  31.2  122.  31.2

Let’s break it down (2)

sm_tbl %>%
    select(lon_O, lat_O, lon_D, lat_D) %>% 
    split(seq(nrow(.))) %>%
    map(function(row){unlist(row) %>% matrix(ncol = 2, byrow=T) %>% st_linestring()}) 
$`1`
LINESTRING (121.3643 31.16001, 121.4226 31.18858)

$`2`
LINESTRING (121.4226 31.18858, 121.3643 31.16001)

$`3`
LINESTRING (121.418 31.26578, 121.4801 31.24006)

$`4`
LINESTRING (121.4801 31.24006, 121.418 31.26578)

$`5`
LINESTRING (121.6169 31.21325, 121.5357 31.18973)

$`6`
LINESTRING (121.5357 31.18973, 121.6169 31.21325)

Let’s break it down (2.1)

row <- tibble (lon_O = 121.364284201405, 
             lat_O = 31.1600086708404, 
             lon_D = 121.422563109174, 
             lat_D = 31.1885822807991)

unlist(row) %>% 
    matrix(ncol = 2, byrow=T)  
         [,1]     [,2]
[1,] 121.3643 31.16001
[2,] 121.4226 31.18858

Let’s break it down (2.2)

unlist(row) %>% 
    matrix(ncol = 2, byrow=T) %>%
    st_linestring()
LINESTRING (121.3643 31.16001, 121.4226 31.18858)

Let’s break it down (3)

sm_tbl %>%
    select(lon_O, lat_O, lon_D, lat_D) %>% 
    split(seq(nrow(.))) %>%
    map(function(row){unlist(row) %>% matrix(ncol = 2, byrow=T) %>% st_linestring()})  %>%
    st_as_sfc(crs = 4326)
Geometry set for 6 features 
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 121.3643 ymin: 31.16001 xmax: 121.6169 ymax: 31.26578
Geodetic CRS:  WGS 84
First 5 geometries:
LINESTRING (121.3643 31.16001, 121.4226 31.18858)
LINESTRING (121.4226 31.18858, 121.3643 31.16001)
LINESTRING (121.418 31.26578, 121.4801 31.24006)
LINESTRING (121.4801 31.24006, 121.418 31.26578)
LINESTRING (121.6169 31.21325, 121.5357 31.18973)

Let’s break it down (4)

scard_geom <- sm_tbl %>%
    select(lon_O, lat_O, lon_D, lat_D) %>% 
    split(seq(nrow(.))) %>%
    map(function(row){unlist(row) %>% matrix(ncol = 2, byrow=T) %>% st_linestring()})  %>%
    st_as_sfc(crs = 4326)


scard_sf_line <- st_sf(sm_tbl, scard_geom)
CardID station_name_O station_name_D geometry
5690 星中路 宜山路 LINESTRING (121.3643 31.160...
5690 宜山路 星中路 LINESTRING (121.4226 31.188...
5951 新村路 南京东路 LINESTRING (121.418 31.2657...
5951 南京东路 新村路 LINESTRING (121.4801 31.240...
6032 广兰路 锦绣路 LINESTRING (121.6169 31.213...
6032 锦绣路 广兰路 LINESTRING (121.5357 31.189...

Visualise it

scard_sf_line %>%
   tm_shape(name = "OD pair") +
    tm_lines(alpha = .1) 

Just because you can doesn’t mean you should

  • Is that the right visualisation?
  • Was it informative? In what context?
  • What kind of information does it show and what does it obscure?
  • How else to convey information?

Reading in external spatial objects

dist <- 
    here("data", "raw", "subway_smartcard", "districts.shp") %>%
    st_read()
Reading layer `districts' from data source 
  `/Users/kaza/Library/CloudStorage/Dropbox/shortcourses/data/raw/subway_smartcard/districts.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 16 features and 7 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 120.8527 ymin: 30.67792 xmax: 122.2425 ymax: 31.87469
Geodetic CRS:  WGS 84
Id city district citycode adcode Shape_Leng Shape_Area geometry
0 上海城区 青浦区 021 310118 1.9667873 0.0630654 MULTIPOLYGON (((121.0643 31...
0 上海城区 杨浦区 021 310110 0.3234397 0.0057430 MULTIPOLYGON (((121.5121 31...
0 上海城区 闵行区 021 310112 1.5004972 0.0351979 MULTIPOLYGON (((121.2635 31...
0 上海城区 宝山区 021 310113 0.9750052 0.0299090 MULTIPOLYGON (((121.5023 31...

Reading in external spatial objects

….. and many more

can be used to write to a disk

st_write(dist, dsn = here("data", "processed", "spatial", "districts.geojson"))

st_write(dist, dsn = here("data", "processed", "spatial", "districts.shp"))

tmap again

tmap_mode("view")

dist %>%
    tm_shape(name = "Districts")+
    tm_polygons(alpha = .3)

Further Work

Tell a story about

  • Individuals who took a large number of trips and their pattern
  • What can you say about diurnal variations relative popularity of certain stations
  • Is popularity spatially autocorrelated?
  • Is popularity conditional on diurnal variation in headways and service quality?
  • Does ignoring the route information affect the story?

Advanced Work

In the raw dataset things are a little/lot more messy

  • Each row corresponds to a swipe of a card and some swipes are not properly registered. Clean the data to eliminate the trips that do not have a ‘Origin’ swipe.

  • Instead of trips, we ought to think of tours; collection of trips that are chained together. What percentage of the trips are taken up by tours in this day?

  • Is there a temporal pattern to these tours?

  • Is there a temporal pattern to breaks in how the chains are constructed?

  • Tell a story about these tours based on the origins (in space and in time)

Thank you