Regions of Deprivation

Introduction

In addition to obvious networks, we can also use graph theory to think though some non-obvious applications. One place this often crops up in planning is to think about adjacency/proximity as a graph. See my post on employment centers and paper on delineating areas.

In this post, I will demonstrate how to use networks for other spatial analysis. We can think of polygons as nodes and two nodes are connected if they share some part of the boundary.

Acquire data

Much of data acquisition follows the post on tidycensus. So I won’t repeat much of it, except to provide an uncommented source code below. In this case, I am focusing on both the Carolinas.

library(tidycensus)
library(sf)
library(tidyverse)
library(tmap)
library(RColorBrewer)
library(igraph)
library(tigris)
library(rgeos)

census_api_key(census_apikey)
options(tigris_use_cache = TRUE)

NC_SC <- c('37', '45') # FIPS code for NC and SC.

pov <- reduce(
  map(NC_SC, function(x) {
    get_acs(geography = "tract", variables = "B17001_002",  summary_var = 'B17001_001',
            state = x, geometry = TRUE, year = 2016)
  }), 
  rbind
) ## read up on these map and reduce functions in purrr. They are key to functional programming paradigm
cbsa <- core_based_statistical_areas() %>% st_as_sf()  # Download Core Based Statistical Areas file
states_shps <- states() %>% 
               st_as_sf() %>%
               filter(GEOID %in% NC_SC) # Limit to the Carolinas
 
MSA <- cbsa[states_shps,] %>%  # Filter CBSA that is intersected/touched by the state boundary.
            filter(LSAD=="M1")  # Limit to Metropolitan areas

# Download CBSAs and MSAs for reference.

cbsa <- core_based_statistical_areas() %>% st_as_sf()  # Download Core Based Statistical Areas file
states_shps <- states() %>% 
               st_as_sf() %>%
               filter(GEOID %in% NC_SC) # Limit to the Carolinas
 
MSA <- cbsa[states_shps,] %>%  # Filter CBSA that is intersected/touched by the state boundary.
            filter(LSAD=="M1")  # Limit to Metropolitan areas
pov <- reduce(
  map(NC_SC, function(x) {
    get_acs(geography = "tract", variables = "B17001_002",  summary_var = 'B17001_001',
            state = x, geometry = TRUE, year = 2016)
  }), 
  rbind
) ## read up on these map and reduce functions in purrr. They are key to functional programming paradigm


pov_rate <- pov %>%
  rename(population = summary_est) %>%
  filter(population>0)%>%
  mutate(pov_rate = estimate/population) %>%
  select(GEOID, NAME, population, pov_rate)

## Unemployment rate
lf_m <- paste("B23001_", formatC(seq(4,67,7), width=3, flag="0"), "E", sep="")
lf_f <- paste("B23001_", formatC(seq(90,153,7), width=3, flag="0"), "E", sep="")

lf <- reduce(
  map(NC_SC, function(x){
    get_acs(geography='tract', variables = c(lf_m, lf_f), state=x, year=2016)
  }),
  rbind
)

unemp_m <- paste("B23001_", formatC(seq(8,71,7), width=3, flag="0"), "E", sep="")
unemp_f <- paste("B23001_", formatC(seq(94,157,7), width=3, flag="0"), "E", sep="")

unemp <- reduce(
  map(NC_SC, function(x){
    get_acs(geography='tract', variables = c(unemp_m, unemp_f), state=x, year=2016)
  }),
  rbind
)

lf_t <- lf %>% 
  group_by(GEOID) %>%
  summarize(lf_est = sum(estimate, na.rm=T))

unemp_t <- unemp %>% 
  group_by(GEOID) %>%
  summarize(unemp_est = sum(estimate, na.rm=T))

unemp_rate <- merge(lf_t, unemp_t, by='GEOID') %>% 
  filter(lf_est >0) %>%
  mutate(unemp_rate = unemp_est/lf_est)

I arbitrarily define distress to be a combination of poverty and unemployment. In this instance, I am picking 20% and 10% as thresholds respectively.

tract_stats <- merge(pov_rate, unemp_rate, by='GEOID')
distressed_tracts <- tract_stats %>%
                     filter(unemp_rate > .10 & pov_rate > .20)

tmap_mode("view")

m1 <- tm_basemap("Stamen.TonerBackground")+
      tm_shape(distressed_tracts) +
      tm_fill(col='red', alpha=.5)+
      tm_tiles("Stamen.TonerLabels")

# The following is only necessary to deal with HTML Widgets in Blogdown. In a normal Rmd document just calling m1 should display the map.     

library(widgetframe)

frameWidget(tmap_leaflet(m1))

Space as a topological network

There are reasons to presume that clusters of distress is more problematic than isolated distress. We could use the contiguity of distressed tracts to our advantage to figure where these clusters are and how they are shaped.

It turns out that the contiguity calculations in sf are still a little slower than older packages such as spdep. So I will convert the simple features (sf classes) into a Spatial object (sp classes) and use older packages.

library(rgeos)
library(spdep)


dt <- as(distressed_tracts, "Spatial") # Convert to spatial
temp1 <- gUnarySTRtreeQuery(dt) # Construct list of polygons that are likely to intersect/touch by first looking at bounding boxes.
dt_nb <- poly2nb(dt, queen=FALSE, foundInBox=temp1, row.names = dt$GEOID) # Construct a neighborhood object

plot(dt)
plot(dt_nb, coordinates(dt), col='red', points=F, add=T) # Quickly visualise the graph

A Digression to Calculating (Graph) Neighbourhood Attributes

Occasionally, it becomes useful not only to look at attributes of the ‘focal’ geography, but also calculate some notion of aggregate neighbourhood attributes. From this, one could derive if the focal geography is an anomaly or if it is consistent with the neighbourhood trend. This becomes important when figuring out hotspots or clusters in Spatial Statistics. For the moment, let me restrict attention to calculating neighbourhood level attributes. Recall that 1 in a binary adjacency matrix represent the neighbours in a graph. So matrix multiplication with the relevant attribute will automatically yield aggregate neighbourhood attribute. Say for example, we want to calculate how many people live in the adjacent census tracts for each tract, we can do the following.

temp1 <- gUnarySTRtreeQuery(tract_stats %>% as("Spatial")) # Construct list of polygons that are likely to intersect/touch by first looking at bounding boxes.
tract_nb <- poly2nb(tract_stats, queen=TRUE, foundInBox=temp1, row.names = tract_stats$GEOID) %>% # Construct a neighborhood object. In this case, I am including neighbors that only touch on a corner using queen contiguity instead of rook continguity
             nb2mat(zero.policy=TRUE, style="B") # Create a Binary Adjacency Matrix.
              
tract_stats <- tract_stats %>%
                   mutate(N_pop = tract_nb %*% population)

library(skimr)

tract_stats %>% 
  st_set_geometry(NULL) %>%
  select("population", "N_pop") %>%
  skim # see how this shakes out.
Name Piped data
Number of rows 3251
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Table 1: Data summary

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
population 0 1 4422.88 2054.36 7 3003.5 4172 5512.5 23972 ▇▅▁▁▁
N_pop 0 1 27758.46 10955.29 0 20044.5 26763 34139.0 91779 ▂▇▂▁▁

Notice something interesting here. The neighbourhood population N_pop has a minimum value 0, even when the minimum value of population is non-zero. This means that there are some census tracts, that do not have any neighbours. We explicitly allowed those kind using the zero.policy = TRUE in the nb2mat function. Just for kicks, lets visualise what tracts they are.

poly2nb(tract_stats, queen=TRUE, foundInBox=temp1, row.names = tract_stats$GEOID)
# Neighbour list object:
# Number of regions: 3251 
# Number of nonzero links: 19752 
# Percentage nonzero weights: 0.1868862 
# Average number of links: 6.075669 
# 1 region with no links:
# 250

# Looks like 250 th region has no links. This is confirmed by looking at the `N_pop`.

tract_stats[250,] 
# Simple feature collection with 1 feature and 8 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -76.58726 ymin: 34.68385 xmax: -76.51952 ymax: 34.71414
# Geodetic CRS:  NAD83
#           GEOID                                               NAME population
# 250 37031970200 Census Tract 9702, Carteret County, North Carolina       1480
#       pov_rate lf_est unemp_est unemp_rate                       geometry N_pop
# 250 0.08108108    534        54  0.1011236 MULTIPOLYGON (((-76.54757 3...     0

m2 <-
  tm_basemap() +
  tm_shape(tract_stats[250,]) +
  tm_fill(col='red')

frameWidget(tmap_leaflet(m2))

We can see why from the map. It turns out that the census tract is an island and the bridge that connects the island to the mainland is not in the geography of the census tract and so it is portrayed as disconnected.


Exercise

  • What if you use a different neighbourhood relationship, for example, based on the thresholds distances of centroids? Hint: Instead of poly2nb use dnearneigh.

Occasionally it becomes useful to calculate not just an aggregate, but an average of the neighbourhood of the focal geography. To that instead of a binary adjacency matrix, we can construct a row-standardised weight matrix from the adjacency matrix (or directly)

tract_nb <- poly2nb(tract_stats, queen=TRUE, foundInBox=temp1, row.names = tract_stats$GEOID) %>% # Construct a neighborhood object. In this case, I am including neighbors that only touch on a corner using queen contiguity instead of rook continguity
             nb2mat(zero.policy=TRUE, style="W") # Create a row standardised matrix. note the style argument.
              
tract_stats <- tract_stats %>%
                   mutate(N_pop_avg = tract_nb %*% population)

library(skimr)

tract_stats %>% 
  st_set_geometry(NULL) %>%
  select("population", "N_pop", "N_pop_avg") %>%
  skim # see how this shakes out.
Name Piped data
Number of rows 3251
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Table 2: Data summary

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
population 0 1 4422.88 2054.36 7 3003.50 4172 5512.50 23972.0 ▇▅▁▁▁
N_pop 0 1 27758.46 10955.29 0 20044.50 26763 34139.00 91779.0 ▂▇▂▁▁
N_pop_avg 0 1 4551.51 1206.86 0 3786.88 4469 5224.26 12012.5 ▁▇▅▁▁

It is then straight forward to visualise which areas have higher population than their neighbours (on average). I am going to use 50% as an arbitrary threshold, and quickly visualise as following.

m2 <- tract_stats %>% 
  filter (population >= N_pop_avg * 1.5) %>%
  tm_shape() +
  tm_fill(col="red")+
  tm_basemap("Stamen.TonerBackground")+
  tm_tiles("Stamen.TonerLabels")

frameWidget(tmap_leaflet(m2))

The row standardised weight matrix gives all the neighbours of a focal geography equal weight (1/rowsum of ones). But there are occasions, when we want to weight the attribute (say unemployment rate) based on a different attribute (say labour force). We take advantage of the element by element multiplication and matrix multiplication to get this.

First we create the denominator, which is the sum of all labour force of the neighbours, just as before using matrix multiplication.

tract_nb <- poly2nb(tract_stats, queen=TRUE, foundInBox=temp1, row.names = tract_stats$GEOID) %>% 
             nb2mat(zero.policy=TRUE, style="B") 

denom_temp <- (tract_nb %*% tract_stats$lf_est)

Here is the tricky bit. We need to get lf_est as non-zero only where tract_nb==1 in the same format as tract_nb. Fortunately multiplication by 0 helps us. * is element by element multiplication, where as %*% is matrix multiplication. In R, when you are doing element-by-element multiplication by need to have the two matrices to be the same dimension. However, tract_stats$lf_est is a vector not a matrix, whereas tract_nb is a matrix.

One of the great, at the same time frustrating, things about R is that it recycles vector to make it conformable. So we take advantage of it, but BE VERY CAREFUL.
num_temp <-  t(tract_nb * as.vector(tract_stats$lf_est))
weight_matrix <- as.matrix(num_temp) / as.vector(denom_temp)

weight_matrix[1:5, 1:5] # look at the subset of the matrix
#              1         2         3         4         5
# [1,] 0.0000000 0.1334121 0.0000000 0.2214671 0.0000000
# [2,] 0.1531089 0.0000000 0.2770765 0.2335983 0.0000000
# [3,] 0.0000000 0.1603053 0.0000000 0.2661104 0.0000000
# [4,] 0.1362880 0.1252601 0.2466361 0.0000000 0.1154113
# [5,] 0.0000000 0.0000000 0.0000000 0.1957047 0.0000000

rowSums(weight_matrix) %>% summary()
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#       1       1       1       1       1       1       1

From the rowSums we know that most of the rows sum to 1, except for 1 row. That has a NA and that has to do with vision by 0. Check the denom_temp and see where the zero is occurring. It should occur for the tract that has no neighbours.

Now it is straightforward to get the labour force weighted unemployment rate of the neighbours

tract_stats <- tract_stats %>%
                   mutate(N_unemp_avg = drop(weight_matrix %*% unemp_rate)) %>%
                   select("unemp_rate", "N_unemp_avg")

library(tmap)

tmap_mode("plot")


tmap_arrange(
  tm_shape(tract_stats) +
    tm_polygons("unemp_rate", legend.format = list(digits=3), n=5, style="quantile", border.alpha=.2),
  
  tm_shape(tract_stats)+
    tm_polygons(col="N_unemp_avg", legend.format = list(digits=3), n=5, style="quantile", border.alpha=.2)
)

Exercise

  • There is a much simpler way to get to this neighbourhood unemployment rate. Figure it out and cross check.

All the above simply calculates the neighbourhood values without including values from the focal geography. So it is a doughnut with a hole type smoothing operator. If you want to remove the hole, simply change the diagonal elements of the matrix to 1 in the neighbourhood matrix.

Distressed Tracts (Contd.)

distressed_tracts_graph <- dt_nb %>%  
  nb2mat(zero.policy=TRUE, style="B") %>% # Create a Binary Adjacency Matrix.
  graph.adjacency(mode='undirected', add.rownames=NULL) # construct an undirected graph from the adjacency matrix.

cl <- clusters(distressed_tracts_graph) # This decomposes the graph into connected subgraphs.

sum(dt$GEOID!= names(cl$membership)) # this is 0. So we can safely assign the clustermembersip variable by rows rather than doing a merge.
# [1] 0
dt$clustermembership <- cl$membership


### Visualisation

# Adding the MSA boundaries for effect to see if the deprivation is primiarly a metropolitan phenomenon or outside the metropolitan areas

pal <- colorRampPalette(brewer.pal(8, "Dark2"))
numColors <- cl$membership %>% unique() %>% length()

m3<-
 dt %>% st_as_sf() %>%
   tm_shape()+
   tm_fill("clustermembership", legend.show = FALSE, palette = pal(numColors))+
   tm_shape(MSA)+
   tm_fill(col='gray85', alpha=.5) +
   tm_basemap("Stamen.TonerBackground")+
   tm_tiles("Stamen.TonerLabels")


frameWidget(tmap_leaflet(m3))

Clusters function decomposes the graph in subgraphs, i.e. nodes belong to a subgraph, if there is a path between them. If there is not, they belong to different subgraphs.

ggplot() + 
  geom_histogram(aes(cl$csize))

From the above histogram it is relatively obvious that the many (64.67 %) of the distressed census tracts are are relatively isolated (cluster size is \(\le\) 2). Many of them are even in clusters of size 1. We could see if there is a geographic pattern in these isolated clusters.

smallclusters <- (1:cl$no)[cl$csize <=2]

m4 <-
  dt %>% st_as_sf()%>%
  filter(clustermembership %in% smallclusters) %>%
  tm_shape()+
  tm_fill(col='red')+
   tm_shape(MSA)+
   tm_fill(col='gray85', alpha=.5) +
   tm_basemap("Stamen.TonerBackground")+
   tm_tiles("Stamen.TonerLabels")



frameWidget(tmap_leaflet(m4))

Now that different tracts are assigned to different clusters, we can summarise the poverty rate in different clusters. Within in each component of the graph, we can also identify a community structure if you wish, using techniques from an earlier post. I leave these as exercises.

In some instances, it may be useful to see if the clusters are stringy’ or if they are ‘blobs.’ i.e., if the clusters follow linear features (such as roads, valleys, rivers etc.) or if they adjacency is shared among multiple polygons of the same component. To do this, we could rely on diameter of the graph, which is the longest path between any two nodes in graph. If the graph is stringy, it will have a large diameter.

In this instance, we want to compute the diameters of each component separately. (diameter of the graph is \(\infty\), as it is not fully connected. )

diameters_of_graphs <-   distressed_tracts_graph %>%
       decompose() %>%
       sapply(function(x)diameter(x)) %>% 
       as_tibble()

names(diameters_of_graphs)[1] <- 'dia'

diameters_of_graphs$cl_membership <- 1:(distressed_tracts_graph %>% decompose() %>% length())


dt <- merge(dt, diameters_of_graphs, by.x='clustermembership', by.y='cl_membership')

# Quickly showing the stringy and blobby (real words) distressed regions
maxdia <- max(dt$dia,na.rm=T)

m5 <- dt %>%
      st_as_sf() %>%
      tm_shape()+
      tm_fill(col='dia',
                  style="fixed",
                  breaks = c(0,4,10,maxdia),
                  palette = c('blue', NA, 'red'),
                  labels = c("Blobby", "Normal", "Stringy"))+
      tm_basemap("Stamen.TonerBackground")+
     tm_tiles("Stamen.TonerLabels")

frameWidget(tmap_leaflet(m5))

Most of large spatial concentrations of distress is primarily in the rural areas (Midlands of South Carolina and Eastern Carolina Plains). However, this may be misleading. There are concentrations of distress within the urban regions.

Since census tracts are very small in urban centres within metropolitan areas, they are more likely to be clustered and separated. So the unequal sizes of geography affects the conclusions you can draw from the map. Perhaps use a different visualisation and further exploratory analysis to identify the truly disadvantaged.

Exercise

  • Limiting your analysis to Metropolitan Areas in Carolinas, what conclusions can you draw about the shape of the clusters?
  • Does limiting your analysis to a single state, affect your results?
  • Does changing the unit of analysis matter? Say e.g. instead of the tract, we use counties for the poverty and unemployment characteristics, how are the conclusions different?

Further analysis

Instead of categorising the areas of distress by dichotomising, one could use local Moran's I to identify locations high values of a continuous variable (say poverty rate) surrounded by other high values areas (or low value areas) and test their statistical significance. local.moran is a function in spdep and relies on the neighbourhood graph. I leave this as an exercise.

Conclusions

Borrowing some concepts from network analysis, helps us in many analytical tasks with regards to space. Todd BenDor and I argued that in some instances it may be beneficial to think about space as networks for modelling purposes too. Tools from graph theory have been incredibly useful to me in a number of different projects.

Nikhil Kaza
Nikhil Kaza
Professor

My research interests include urbanization patterns, local energy policy and equity

Related