Regions of Deprivation
Introduction
In addition to obvious networks, we can also use graph theory to think though some non-obvious applications. One place this often crops up in planning is to think about adjacency/proximity as a graph. See my post on employment centers and paper on delineating areas.
In this post, I will demonstrate how to use networks for other spatial analysis. We can think of polygons as nodes and two nodes are connected if they share some part of the boundary.
Acquire data
Much of data acquisition follows the post on tidycensus. So I won’t repeat much of it, except to provide an uncommented source code below. In this case, I am focusing on both the Carolinas.
library(tidycensus)
library(sf)
library(tidyverse)
library(tmap)
library(RColorBrewer)
library(igraph)
library(tigris)
library(rgeos)
census_api_key(census_apikey)
options(tigris_use_cache = TRUE)
NC_SC <- c('37', '45') # FIPS code for NC and SC.
pov <- reduce(
map(NC_SC, function(x) {
get_acs(geography = "tract", variables = "B17001_002", summary_var = 'B17001_001',
state = x, geometry = TRUE, year = 2016)
}),
rbind
) ## read up on these map and reduce functions in purrr. They are key to functional programming paradigm
cbsa <- core_based_statistical_areas() %>% st_as_sf() # Download Core Based Statistical Areas file
states_shps <- states() %>%
st_as_sf() %>%
filter(GEOID %in% NC_SC) # Limit to the Carolinas
MSA <- cbsa[states_shps,] %>% # Filter CBSA that is intersected/touched by the state boundary.
filter(LSAD=="M1") # Limit to Metropolitan areas
# Download CBSAs and MSAs for reference.
cbsa <- core_based_statistical_areas() %>% st_as_sf() # Download Core Based Statistical Areas file
states_shps <- states() %>%
st_as_sf() %>%
filter(GEOID %in% NC_SC) # Limit to the Carolinas
MSA <- cbsa[states_shps,] %>% # Filter CBSA that is intersected/touched by the state boundary.
filter(LSAD=="M1") # Limit to Metropolitan areas
pov <- reduce(
map(NC_SC, function(x) {
get_acs(geography = "tract", variables = "B17001_002", summary_var = 'B17001_001',
state = x, geometry = TRUE, year = 2016)
}),
rbind
) ## read up on these map and reduce functions in purrr. They are key to functional programming paradigm
pov_rate <- pov %>%
rename(population = summary_est) %>%
filter(population>0)%>%
mutate(pov_rate = estimate/population) %>%
select(GEOID, NAME, population, pov_rate)
## Unemployment rate
lf_m <- paste("B23001_", formatC(seq(4,67,7), width=3, flag="0"), "E", sep="")
lf_f <- paste("B23001_", formatC(seq(90,153,7), width=3, flag="0"), "E", sep="")
lf <- reduce(
map(NC_SC, function(x){
get_acs(geography='tract', variables = c(lf_m, lf_f), state=x, year=2016)
}),
rbind
)
unemp_m <- paste("B23001_", formatC(seq(8,71,7), width=3, flag="0"), "E", sep="")
unemp_f <- paste("B23001_", formatC(seq(94,157,7), width=3, flag="0"), "E", sep="")
unemp <- reduce(
map(NC_SC, function(x){
get_acs(geography='tract', variables = c(unemp_m, unemp_f), state=x, year=2016)
}),
rbind
)
lf_t <- lf %>%
group_by(GEOID) %>%
summarize(lf_est = sum(estimate, na.rm=T))
unemp_t <- unemp %>%
group_by(GEOID) %>%
summarize(unemp_est = sum(estimate, na.rm=T))
unemp_rate <- merge(lf_t, unemp_t, by='GEOID') %>%
filter(lf_est >0) %>%
mutate(unemp_rate = unemp_est/lf_est)
I arbitrarily define distress to be a combination of poverty and unemployment. In this instance, I am picking 20% and 10% as thresholds respectively.
tract_stats <- merge(pov_rate, unemp_rate, by='GEOID')
distressed_tracts <- tract_stats %>%
filter(unemp_rate > .10 & pov_rate > .20)
tmap_mode("view")
m1 <- tm_basemap("Stamen.TonerBackground")+
tm_shape(distressed_tracts) +
tm_fill(col='red', alpha=.5)+
tm_tiles("Stamen.TonerLabels")
# The following is only necessary to deal with HTML Widgets in Blogdown. In a normal Rmd document just calling m1 should display the map.
library(widgetframe)
frameWidget(tmap_leaflet(m1))
Space as a topological network
There are reasons to presume that clusters of distress is more problematic than isolated distress. We could use the contiguity of distressed tracts to our advantage to figure where these clusters are and how they are shaped.
It turns out that the contiguity calculations in sf
are still a little slower than older packages such as spdep
. So I will convert the simple features (sf classes) into a Spatial object (sp classes) and use older packages.
library(rgeos)
library(spdep)
dt <- as(distressed_tracts, "Spatial") # Convert to spatial
temp1 <- gUnarySTRtreeQuery(dt) # Construct list of polygons that are likely to intersect/touch by first looking at bounding boxes.
dt_nb <- poly2nb(dt, queen=FALSE, foundInBox=temp1, row.names = dt$GEOID) # Construct a neighborhood object
plot(dt)
plot(dt_nb, coordinates(dt), col='red', points=F, add=T) # Quickly visualise the graph
A Digression to Calculating (Graph) Neighbourhood Attributes
Occasionally, it becomes useful not only to look at attributes of the ‘focal’ geography, but also calculate some notion of aggregate neighbourhood attributes. From this, one could derive if the focal geography is an anomaly or if it is consistent with the neighbourhood trend. This becomes important when figuring out hotspots or clusters in Spatial Statistics. For the moment, let me restrict attention to calculating neighbourhood level attributes. Recall that 1 in a binary adjacency matrix represent the neighbours in a graph. So matrix multiplication with the relevant attribute will automatically yield aggregate neighbourhood attribute. Say for example, we want to calculate how many people live in the adjacent census tracts for each tract, we can do the following.
temp1 <- gUnarySTRtreeQuery(tract_stats %>% as("Spatial")) # Construct list of polygons that are likely to intersect/touch by first looking at bounding boxes.
tract_nb <- poly2nb(tract_stats, queen=TRUE, foundInBox=temp1, row.names = tract_stats$GEOID) %>% # Construct a neighborhood object. In this case, I am including neighbors that only touch on a corner using queen contiguity instead of rook continguity
nb2mat(zero.policy=TRUE, style="B") # Create a Binary Adjacency Matrix.
tract_stats <- tract_stats %>%
mutate(N_pop = tract_nb %*% population)
library(skimr)
tract_stats %>%
st_set_geometry(NULL) %>%
select("population", "N_pop") %>%
skim # see how this shakes out.
Name | Piped data |
Number of rows | 3251 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 2 |
________________________ | |
Group variables | None |
Table 1: Data summary
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
population | 0 | 1 | 4422.88 | 2054.36 | 7 | 3003.5 | 4172 | 5512.5 | 23972 | ▇▅▁▁▁ |
N_pop | 0 | 1 | 27758.46 | 10955.29 | 0 | 20044.5 | 26763 | 34139.0 | 91779 | ▂▇▂▁▁ |
Notice something interesting here. The neighbourhood population N_pop
has a minimum value 0, even when the minimum value of population is non-zero. This means that there are some census tracts, that do not have any neighbours. We explicitly allowed those kind using the zero.policy = TRUE
in the nb2mat
function. Just for kicks, lets visualise what tracts they are.
poly2nb(tract_stats, queen=TRUE, foundInBox=temp1, row.names = tract_stats$GEOID)
# Neighbour list object:
# Number of regions: 3251
# Number of nonzero links: 19752
# Percentage nonzero weights: 0.1868862
# Average number of links: 6.075669
# 1 region with no links:
# 250
# Looks like 250 th region has no links. This is confirmed by looking at the `N_pop`.
tract_stats[250,]
# Simple feature collection with 1 feature and 8 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -76.58726 ymin: 34.68385 xmax: -76.51952 ymax: 34.71414
# Geodetic CRS: NAD83
# GEOID NAME population
# 250 37031970200 Census Tract 9702, Carteret County, North Carolina 1480
# pov_rate lf_est unemp_est unemp_rate geometry N_pop
# 250 0.08108108 534 54 0.1011236 MULTIPOLYGON (((-76.54757 3... 0
m2 <-
tm_basemap() +
tm_shape(tract_stats[250,]) +
tm_fill(col='red')
frameWidget(tmap_leaflet(m2))
We can see why from the map. It turns out that the census tract is an island and the bridge that connects the island to the mainland is not in the geography of the census tract and so it is portrayed as disconnected.
Exercise
- What if you use a different neighbourhood relationship, for example, based on the thresholds distances of centroids? Hint: Instead of
poly2nb
usednearneigh
.
Occasionally it becomes useful to calculate not just an aggregate, but an average of the neighbourhood of the focal geography. To that instead of a binary adjacency matrix, we can construct a row-standardised weight matrix from the adjacency matrix (or directly)
tract_nb <- poly2nb(tract_stats, queen=TRUE, foundInBox=temp1, row.names = tract_stats$GEOID) %>% # Construct a neighborhood object. In this case, I am including neighbors that only touch on a corner using queen contiguity instead of rook continguity
nb2mat(zero.policy=TRUE, style="W") # Create a row standardised matrix. note the style argument.
tract_stats <- tract_stats %>%
mutate(N_pop_avg = tract_nb %*% population)
library(skimr)
tract_stats %>%
st_set_geometry(NULL) %>%
select("population", "N_pop", "N_pop_avg") %>%
skim # see how this shakes out.
Name | Piped data |
Number of rows | 3251 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
numeric | 3 |
________________________ | |
Group variables | None |
Table 2: Data summary
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
population | 0 | 1 | 4422.88 | 2054.36 | 7 | 3003.50 | 4172 | 5512.50 | 23972.0 | ▇▅▁▁▁ |
N_pop | 0 | 1 | 27758.46 | 10955.29 | 0 | 20044.50 | 26763 | 34139.00 | 91779.0 | ▂▇▂▁▁ |
N_pop_avg | 0 | 1 | 4551.51 | 1206.86 | 0 | 3786.88 | 4469 | 5224.26 | 12012.5 | ▁▇▅▁▁ |
It is then straight forward to visualise which areas have higher population than their neighbours (on average). I am going to use 50% as an arbitrary threshold, and quickly visualise as following.
m2 <- tract_stats %>%
filter (population >= N_pop_avg * 1.5) %>%
tm_shape() +
tm_fill(col="red")+
tm_basemap("Stamen.TonerBackground")+
tm_tiles("Stamen.TonerLabels")
frameWidget(tmap_leaflet(m2))
The row standardised weight matrix gives all the neighbours of a focal geography equal weight (1/rowsum of ones). But there are occasions, when we want to weight the attribute (say unemployment rate) based on a different attribute (say labour force). We take advantage of the element by element multiplication and matrix multiplication to get this.
First we create the denominator, which is the sum of all labour force of the neighbours, just as before using matrix multiplication.
tract_nb <- poly2nb(tract_stats, queen=TRUE, foundInBox=temp1, row.names = tract_stats$GEOID) %>%
nb2mat(zero.policy=TRUE, style="B")
denom_temp <- (tract_nb %*% tract_stats$lf_est)
Here is the tricky bit. We need to get lf_est as non-zero only where tract_nb==1 in the same format as tract_nb. Fortunately multiplication by 0 helps us. *
is element by element multiplication, where as %*%
is matrix multiplication. In R, when you are doing element-by-element multiplication by need to have the two matrices to be the same dimension. However, tract_stats$lf_est
is a vector not a matrix, whereas tract_nb
is a matrix.
num_temp <- t(tract_nb * as.vector(tract_stats$lf_est))
weight_matrix <- as.matrix(num_temp) / as.vector(denom_temp)
weight_matrix[1:5, 1:5] # look at the subset of the matrix
# 1 2 3 4 5
# [1,] 0.0000000 0.1334121 0.0000000 0.2214671 0.0000000
# [2,] 0.1531089 0.0000000 0.2770765 0.2335983 0.0000000
# [3,] 0.0000000 0.1603053 0.0000000 0.2661104 0.0000000
# [4,] 0.1362880 0.1252601 0.2466361 0.0000000 0.1154113
# [5,] 0.0000000 0.0000000 0.0000000 0.1957047 0.0000000
rowSums(weight_matrix) %>% summary()
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 1 1 1 1 1 1 1
From the rowSums we know that most of the rows sum to 1, except for 1 row. That has a NA and that has to do with vision by 0. Check the denom_temp and see where the zero is occurring. It should occur for the tract that has no neighbours.
Now it is straightforward to get the labour force weighted unemployment rate of the neighbours
tract_stats <- tract_stats %>%
mutate(N_unemp_avg = drop(weight_matrix %*% unemp_rate)) %>%
select("unemp_rate", "N_unemp_avg")
library(tmap)
tmap_mode("plot")
tmap_arrange(
tm_shape(tract_stats) +
tm_polygons("unemp_rate", legend.format = list(digits=3), n=5, style="quantile", border.alpha=.2),
tm_shape(tract_stats)+
tm_polygons(col="N_unemp_avg", legend.format = list(digits=3), n=5, style="quantile", border.alpha=.2)
)
Exercise
- There is a much simpler way to get to this neighbourhood unemployment rate. Figure it out and cross check.
Distressed Tracts (Contd.)
distressed_tracts_graph <- dt_nb %>%
nb2mat(zero.policy=TRUE, style="B") %>% # Create a Binary Adjacency Matrix.
graph.adjacency(mode='undirected', add.rownames=NULL) # construct an undirected graph from the adjacency matrix.
cl <- clusters(distressed_tracts_graph) # This decomposes the graph into connected subgraphs.
sum(dt$GEOID!= names(cl$membership)) # this is 0. So we can safely assign the clustermembersip variable by rows rather than doing a merge.
# [1] 0
dt$clustermembership <- cl$membership
### Visualisation
# Adding the MSA boundaries for effect to see if the deprivation is primiarly a metropolitan phenomenon or outside the metropolitan areas
pal <- colorRampPalette(brewer.pal(8, "Dark2"))
numColors <- cl$membership %>% unique() %>% length()
m3<-
dt %>% st_as_sf() %>%
tm_shape()+
tm_fill("clustermembership", legend.show = FALSE, palette = pal(numColors))+
tm_shape(MSA)+
tm_fill(col='gray85', alpha=.5) +
tm_basemap("Stamen.TonerBackground")+
tm_tiles("Stamen.TonerLabels")
frameWidget(tmap_leaflet(m3))
Clusters function decomposes the graph in subgraphs, i.e. nodes belong to a subgraph, if there is a path between them. If there is not, they belong to different subgraphs.
ggplot() +
geom_histogram(aes(cl$csize))
From the above histogram it is relatively obvious that the many (64.67 %) of the distressed census tracts are are relatively isolated (cluster size is \(\le\)
2). Many of them are even in clusters of size 1. We could see if there is a geographic pattern in these isolated clusters.
smallclusters <- (1:cl$no)[cl$csize <=2]
m4 <-
dt %>% st_as_sf()%>%
filter(clustermembership %in% smallclusters) %>%
tm_shape()+
tm_fill(col='red')+
tm_shape(MSA)+
tm_fill(col='gray85', alpha=.5) +
tm_basemap("Stamen.TonerBackground")+
tm_tiles("Stamen.TonerLabels")
frameWidget(tmap_leaflet(m4))
Now that different tracts are assigned to different clusters, we can summarise the poverty rate in different clusters. Within in each component of the graph, we can also identify a community structure if you wish, using techniques from an earlier post. I leave these as exercises.
In some instances, it may be useful to see if the clusters are stringy’ or if they are ‘blobs.’ i.e., if the clusters follow linear features (such as roads, valleys, rivers etc.) or if they adjacency is shared among multiple polygons of the same component. To do this, we could rely on diameter of the graph, which is the longest path between any two nodes in graph. If the graph is stringy, it will have a large diameter.
In this instance, we want to compute the diameters of each component separately. (diameter of the graph is \(\infty\)
, as it is not fully connected. )
diameters_of_graphs <- distressed_tracts_graph %>%
decompose() %>%
sapply(function(x)diameter(x)) %>%
as_tibble()
names(diameters_of_graphs)[1] <- 'dia'
diameters_of_graphs$cl_membership <- 1:(distressed_tracts_graph %>% decompose() %>% length())
dt <- merge(dt, diameters_of_graphs, by.x='clustermembership', by.y='cl_membership')
# Quickly showing the stringy and blobby (real words) distressed regions
maxdia <- max(dt$dia,na.rm=T)
m5 <- dt %>%
st_as_sf() %>%
tm_shape()+
tm_fill(col='dia',
style="fixed",
breaks = c(0,4,10,maxdia),
palette = c('blue', NA, 'red'),
labels = c("Blobby", "Normal", "Stringy"))+
tm_basemap("Stamen.TonerBackground")+
tm_tiles("Stamen.TonerLabels")
frameWidget(tmap_leaflet(m5))
Most of large spatial concentrations of distress is primarily in the rural areas (Midlands of South Carolina and Eastern Carolina Plains). However, this may be misleading. There are concentrations of distress within the urban regions.
Exercise
- Limiting your analysis to Metropolitan Areas in Carolinas, what conclusions can you draw about the shape of the clusters?
- Does limiting your analysis to a single state, affect your results?
- Does changing the unit of analysis matter? Say e.g. instead of the tract, we use counties for the poverty and unemployment characteristics, how are the conclusions different?
Further analysis
Instead of categorising the areas of distress by dichotomising, one could use local Moran's I
to identify locations high values of a continuous variable (say poverty rate) surrounded by other high values areas (or low value areas) and test their statistical significance. local.moran
is a function in spdep and relies on the neighbourhood graph. I leave this as an exercise.
Conclusions
Borrowing some concepts from network analysis, helps us in many analytical tasks with regards to space. Todd BenDor and I argued that in some instances it may be beneficial to think about space as networks for modelling purposes too. Tools from graph theory have been incredibly useful to me in a number of different projects.