Introduction to Urban Analytics with R

Nihil Kaza

4/20/23

Urban Analytics

Harmonising and analysing data from

stationary sensors 🚥
mobile sensors
- ground 🌍
- areal ✈️
administrative 📜
video 🎥

in the service of urban sustainability, resilience, health …

Today’s plan

What is R? What is Rstudio?
What is tidyverse?
Basic functions for manipulating data
Using R effectively
More data manipulation
A peek at next week

What is R? What is Rstudio ?

1 2 3 4 5 6

What is R?

1 2 3 4 5 6

R is an open-source (free!) scripting language for working with data.

It started out as a statistical analysis language but is so much more now.

Getting R & Rstudio

1 2 3 4 5 6

You need the R language

And also the Integrated Development Environment

Navigating RStudio

1 2 3 4 5 6

project files are here.
Visualisations show up here

imported data shows up here

code can go here

Navigating RStudio

1 2 3 4 5 6

project files are here

imported data shows up here

code can also
go here

What is tidyverse?

1 2 3 4 5 6

Using R with tidyverse

1 2 3 4 5 6

You use R via packages

…which contain functions

…which are just verbs

Tidyverse is a collection of R packages designed to work around a common philosophy.

Best way to learn the tidyverse

http://r4ds.had.co.nz/

Rectangular Data

Put each dataset in a tibble.
Put each variable in a column.
Each column should be a data type (integer, boolean, date, character etc.)
Each row is ‘nominally’ independent of another row
Each column is ‘nominally’ independent of another column

Today’s data

1 2 3 4 5 6

Real Time AQI is published by the Shanghai Environmental Monitoring Center (SEMC).

Today’s data

1 2 3 4 5 6

AQI_tbl

TIME	SITEID	AQI	GRADE	AIRQUALITY	PPITEMID
2014/7/1 0:00	201	78	2	良	9
2014/7/1 0:00	209	84	2	良	9
2014/7/1 0:00	185	69	2	良	9
2014/7/1 0:00	203	73	2	良	9
2014/7/1 0:00	215	73	2	良	9
2014/7/1 0:00	0	69	2	良	9

Contains 79365 observations with 6 variables.

Today’s data

1 2 3 4 5 6

stations_tbl

SITEID	SITENAME	ORDERID	ISCITY	ISDISTRICT	ISNATIONAL	ISACTIVE	SITENAMEEN	ADDRESS	ISPUBLIC	LON	LAT
215	虹口凉城	5	-1	-1	-1	-1	Hongkou Liangcheng	虹口区凉城路854号凉城社区卫生中心	-1	121.4718	31.29878
201	普陀监测站	1	-1	-1	-1	-1	Putuo Station	普陀区杏山路317号曹杨社区文化中心	-1	121.4060	31.23642
209	杨浦四漂	2	-1	-1	-1	-1	Yangpu Sipiao	杨浦区平凉路1398号	-1	121.5381	31.26435

Contains 10 observations with 12 variables.

Today’s data

1 2 3 4 5 6

AQI_tbl

variable	type	unique
TIME	chr	7215
SITEID	dbl	11
AQI	chr	365
GRADE	chr	7
AIRQUALITY	chr	7
PPITEMID	chr	7

stations_tbl

variable	type	unique
SITEID	dbl	10
SITENAME	chr	10
ORDERID	dbl	10
ISCITY	dbl	2
ISDISTRICT	dbl	1
ISNATIONAL	dbl	1
ISACTIVE	dbl	1
SITENAMEEN	chr	10
ADDRESS	chr	10
ISPUBLIC	dbl	1
LON	dbl	10
LAT	dbl	10

Today’s data

1 2 3 4 5 6

Today’s data

1 2 3 4 5 6

Useful operators & keyboard shortcuts

1 2 3 4 5 6

<-

“assign to”

Opt + - (Mac)
Alt + - (Windows/linux)

%>%

“and then”

Cmd + shift + m (Mac)
Cntrl + shift + m (Windows/Linux)

Syntax of a function

1 2 3 4 5 6

function(object, argument(s))

is the same as

object %>%

function(argument(s))

Read external data

AQI_tbl <- read_csv("wk1_intro/data/AQI.csv", 
                    locale = locale(encoding = "GBK")
                    )

is the same as

AQI_tbl <- "wk1_intro/data/AQI.csv" %>%
             read_csv(locale = locale(encoding = "GBK"))

AQI_tbl

TIME	SITEID	AQI	GRADE	AIRQUALITY	PPITEMID
2014/7/1 0:00	201	78	2	良	9
2014/7/1 0:00	209	84	2	良	9
2014/7/1 0:00	185	69	2	良	9
2014/7/1 0:00	203	73	2	良	9
2014/7/1 0:00	215	73	2	良	9
2014/7/1 0:00	0	69	2	良	9

Basic data manipulation

1 2 3 4 5 6

Common functions

1 2 3 4 5 6

mutate creates new columns (aka variables) or modify existing columns

filter keeps or discards rows (aka observations)

select keeps or discards columns

arrange sorts data set by certain variable(s)

count tallies data set by certain variable(s)

group_by/summarize aggregates data (pivot tables!)

str_* functions work easily with text

ymd* functions work easily for date/time

Mutate

1 2 3 4 5 6

mutate can modify existing columns

AQI_tbl %>% 
  mutate(TIME = lubridate::ymd_hm(TIME))

TIME	SITEID	AQI	GRADE	AIRQUALITY	PPITEMID
2014-07-01	201	78	2	良	9
2014-07-01	209	84	2	良	9
2014-07-01	185	69	2	良	9
2014-07-01	203	73	2	良	9

variable	type
TIME	dat
SITEID	dbl
AQI	chr
GRADE	chr
AIRQUALITY	chr
PPITEMID	chr

Mutate

1 2 3 4 5 6

can create new variables
the pipe %>% chains multiple functions together

AQI_tbl %>% 
  mutate(TIME = lubridate::ymd_hm(TIME)) %>%
  mutate(month_of_obs = month(TIME, label = T))

TIME	month_of_obs	SITEID	AQI
2014-07-01	Jul	201	78
2014-07-01	Jul	209	84
2014-07-01	Jul	185	69
2014-07-01	Jul	203	73

variable	type
TIME	dat
SITEID	dbl
AQI	chr
GRADE	chr
AIRQUALITY	chr
PPITEMID	chr
month_of_obs	fct

A digression into factors

1 2 3 4 5 6

Factors are categorical variables.
Sometimes what show up as characters or numbers should be factors

AQI_tbl %>%
    mutate(GRADE = factor(
        GRADE,
        levels = 1:6,
        labels = c(
            "Excellent",
            "Good",
            "Lightly Polluted",
            "Moderately Polluted",
            "Heavily Polluted",
            "Extermely Polluted"
        )
    ))

TIME	AQI	GRADE
2014/7/1 0:00	78	Good
2014/7/1 0:00	84	Good
2014/7/1 0:00	69	Good
2014/7/1 0:00	73	Good

Mutate

1 2 3 4 5 6

with& (AND), | (OR) there is enormous flexibility

AQI_tbl %>% 
  mutate(TIME = lubridate::ymd_hm(TIME)) %>%
  mutate(month_of_obs = month(TIME)) %>%
  mutate(season = case_when(
       month_of_obs >= 3 & month_of_obs < 6 ~ "spring",
       month_of_obs >=6 & month_of_obs < 9 ~ "summer",
       month_of_obs >=9 & month_of_obs < 12 ~ "autumn",
       month_of_obs = 12 | month_of_obs <3 ~ "winter",
       .default = "NA"
        )
         )

TIME	month_of_obs	season
2014-07-01	7	summer
2014-07-01	7	summer
2014-07-01	7	summer
2014-07-01	7	summer

Filter

1 2 3 4 5 6

filter keeps or discards rows (aka observations)

the == operator tests for equality

AQI_tbl %>% 
  filter(SITEID == 209)

Notice that 209 (numeric) is not “209” (character) in the code.

TIME	SITEID	AQI	GRADE	AIRQUALITY	PPITEMID
2014/7/1 0:00	209	84	2	良	9
2014/7/1 1:00	209	79	2	良	9
2014/7/1 2:00	209	74	2	良	9
2014/7/1 3:00	209	74	2	良	9
2014/7/1 4:00	209	73	2	良	9
2014/7/1 5:00	209	79	2	良	9

Filter

1 2 3 4 5 6

AQI_tbl %>% 
  filter(SITEID == 209 & 
           GRADE == "2")

is the same as

AQI_tbl %>% 
  filter(SITEID == 209) %>%
  filter(GRADE == "2")

Why is 2 in " " but not 209?

TIME	SITEID	AQI	GRADE	AIRQUALITY	PPITEMID
2014/7/1 0:00	209	84	2	良	9
2014/7/1 1:00	209	79	2	良	9
2014/7/1 2:00	209	74	2	良	9
2014/7/1 3:00	209	74	2	良	9
2014/7/1 4:00	209	73	2	良	9
2014/7/1 5:00	209	79	2	良	9

Filter

1 2 3 4 5 6

the %in% operator allows for multiple options in a list

AQI_tbl %>% 
  filter(SITEID %in% c(201, 207, 193))

TIME	SITEID	AQI	GRADE	AIRQUALITY	PPITEMID
2014/7/1 0:00	201	78	2	良	9
2014/7/1 0:00	207	72	2	良	9
2014/7/1 0:00	193	46	1	优	9
2014/7/1 1:00	201	75	2	良	9
2014/7/1 1:00	207	67	2	良	9
2014/7/1 1:00	193	54	2	良	9

Select

1 2 3 4 5 6

select keeps or discards columns (aka variables)

AQI_tbl %>% 
  select(TIME, SITEID, AQI, GRADE)

TIME	SITEID	AQI	GRADE
2014/7/1 0:00	201	78	2
2014/7/1 0:00	209	84	2
2014/7/1 0:00	185	69	2
2014/7/1 0:00	203	73	2
2014/7/1 0:00	215	73	2
2014/7/1 0:00	0	69	2

Select

1 2 3 4 5 6

can drop columns with -column

AQI_tbl %>% 
  select(-PPITEMID)

TIME	SITEID	AQI	GRADE	AIRQUALITY
2014/7/1 0:00	201	78	2	良
2014/7/1 0:00	209	84	2	良
2014/7/1 0:00	185	69	2	良
2014/7/1 0:00	203	73	2	良
2014/7/1 0:00	215	73	2	良
2014/7/1 0:00	0	69	2	良

Select

1 2 3 4 5 6

the pipe %>% chains multiple functions together

AQI_tbl %>% 
  select(-PPITEMID) %>% 
  filter(GRADE == "2") %>%
  filter(SITEID == 209)

is the same as

filter(filter(select(AQI_tbl, -PPITEMID),GRADE == "2"),SITEID == 209)

is the same as

temp_1 <- select(AQI_tbl, -PPITEMID)
temp_2 <- filter(temp1, GRADE == "2")

temp_2 %>% 
    filter(SITEID == 209)
rm(temp_1, temp_2)

TIME	SITEID	AQI	GRADE	AIRQUALITY
2014/7/1 0:00	209	84	2	良
2014/7/1 1:00	209	79	2	良
2014/7/1 2:00	209	74	2	良
2014/7/1 3:00	209	74	2	良
2014/7/1 4:00	209	73	2	良
2014/7/1 5:00	209	79	2	良

Arrange

1 2 3 4 5 6

arrange sorts the dataset by certain variable(s)

AQI_tbl %>% 
  mutate(TIME = ymd_hm(TIME)) %>%
  arrange(TIME)

TIME	SITEID	AQI	GRADE	AIRQUALITY
2014-07-01	201	78	2	良
2014-07-01	209	84	2	良
2014-07-01	185	69	2	良
2014-07-01	203	73	2	良
2014-07-01	215	73	2	良
2014-07-01	0	69	2	良

Arrange

1 2 3 4 5 6

use desc to reverse the sort

AQI_tbl %>% 
  mutate(TIME = ymd_hm(TIME)) %>%
  arrange(desc(TIME))

TIME	SITEID	AQI	GRADE	AIRQUALITY
2015-04-30 23:00:00	228	45	1	优
2015-04-30 23:00:00	195	43	1	优
2015-04-30 23:00:00	0	53	2	良
2015-04-30 23:00:00	201	60	2	良
2015-04-30 23:00:00	209	36	1	优
2015-04-30 23:00:00	185	58	2	良

Arrange

1 2 3 4 5 6

can sort by multiple variables

AQI_tbl %>% 
  arrange(SITEID, TIME)

TIME	AQI	GRADE	AIRQUALITY	PPITEMID
2014/10/1 0:00	36	1	优	4
2014/10/1 10:00	38	1	优	4
2014/10/1 11:00	38	1	优	4
2014/10/1 12:00	41	1	优	4
2014/10/1 13:00	44	1	优	4
2014/10/1 14:00	54	2	良	4

Count

1 2 3 4 5 6

count tallies data set by certain variable(s) (very useful for familiarizing yourself with data)

AQI_tbl %>% 
  count(GRADE)

GRADE	n
/	1345
1	25714
2	33233
3	12285
4	3764
5	2778

Count

1 2 3 4 5 6

can use sort = TRUE to order results

AQI_tbl %>% 
  count(GRADE, sort = TRUE)

GRADE	n
2	33233
1	25714
3	12285
4	3764
5	2778
/	1345
6	246

Group by / summarize

1 2 3 4 5 6

group_by/summarize aggregates data (pivot tables!)

group_by() identifies the grouping variable(s) and summarize() specifies the aggregation

AQI_tbl <-
    AQI_tbl %>% 
  mutate(SITEID = as.character(SITEID))%>%
  mutate(AQI = as.numeric(AQI))# Why do we need to do this?

AQI_tbl %>% 
  group_by(SITEID) %>%
  summarise(
      min_AQI = min(AQI, na.rm = T),
      max_AQI = max(AQI, na.rm=T))

SITEID	min_AQI	max_AQI
0	10	407
183	11	418
185	9	419
193	9	476
195	7	405
201	7	403

Group by / summarize

1 2 3 4 5 6

More complicated summary is easy with piping and other functions such as max, min, sum, n

AQI_tbl %>% 
  mutate(time_date = date(TIME))%>%
  group_by(SITEID, time_date) %>%
  summarise(max_AQI_day = max(AQI, na.rm=T)) %>%
  ungroup() %>%
  mutate(bad_airq_day = ifelse(max_AQI_day>100, 1, 0)) %>%
  select(SITEID, bad_airq_day) %>%
  group_by(SITEID) %>%
  summarize(count_bad_airq_days = sum(bad_airq_day))

SITEID	count_bad_airq_days
0	151
183	162
185	161
193	152
195	156
201	166

Using R effectively

1 2 3 4 5 6

Working in RStudio

1 2 3 4 5 6

project files are here

imported data shows up here

code can also
go here

Working in RStudio

1 2 3 4 5 6

Typing in the console

think of it like a post-it: useful for quick notes but disposable
actions are saved but code is not
one chunk of code is run at a time (Return)

Typing in a code file

script files have a .R extension
code is saved and sections of any size can be run (Cmd + Return)
do ~95% of your typing in a code file instead of the console!

Working with packages

1 2 3 4 5 6

packages need to be installed on each computer you use

# only need to do this once (per computer)
install.packages("tidyverse")

packages need to be loaded/attached with library() at the beginning of every session

# always put the necessary packages at the top of a code file
library(tidyverse)

can access help files by typing ??tidyverse or ?mutate in the console

Organizing with projects

1 2 3 4 5 6

highly recommend using projects to stay organized

keeps code files and data files together, allowing for easier file path navigation and better reproducible work habits

File -> New Project

more guidance: here and here

Organizing with projects

1 2 3 4 5 6

project files are here

imported data shows up here

code can also
go here

Organizing with projects

1 2 3 4 5 6

.
└── my_awesome_project
    ├── my_awesome_project.Rproj
    ├── run_analyses.R 
    ├── src
        ├── 01_first_script.R
        └── 02_second_script.R
    ├── R 
        ├── useful_function_1.R
        └── useful_function_2.R
    ├── writing
        ├── 01_chapter.Rmd
    ├── output
    ├── data
        ├── raw
        └── temp
        └── processed
   └──README.md

Always here

1 2 3 4 5 6

Jenny Bryan once said

    If the first line of your R script is

    setwd("C:\Users\jenny\path\that\only\I\have")

    I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.
    
    

    If the first line of your R script is

    rm(list = ls())

    I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.

I will happily supply her the matches and fuel

Always here

1 2 3 4 5 6

instead use the here function in the here package

library(here)
AQI_tbl <- here("data", "raw", "AQI.csv") %>%
    read_csv()

here converts the file path to

[1] "/Users/kaza/Library/CloudStorage/Dropbox/shortcourses/data/raw/AQI.csv"

which allows me to change computers, Operating Systems etc.
don’t have to worry about “/” or “\”

Ask for help!

1 2 3 4 5 6

use ? - help
use ?? - help.search

?mutate

Ask for help!

1 2 3 4 5 6

Ask for help! (Not like this)

1 2 3 4 5 6

I am importing fairly large csv’s (2 - 3 million rows). When I import these using read_csv, it fails. Anyone know why?

This question is missing the key information required to reproduce and troubleshoot the problem:

  How is the datafile being imported delimited(csv(comma-delimited), other)?

  What operating system is involved(Linux, Windows)? What locale?

  What version of R running? What functions and libraries are being used?

  The post does not provide an example of the R code, with the data that led to the problem

Working with the data

1 2 3 4 5 6

Importing data

1 2 3 4 5 6

Create and name a Project in Rstudio.
Create the specified directory structure in explorer/finder
Download the AQI.csv and stations.csv to the data/raw/ subdirectory
Create a 01_my_first_script.R with in the src subdirectory and open it in Rstudio

Importing data

1 2 3 4 5 6

Type the following code into the R file and run it.

library(here)
library(tidyverse)

AQI_tbl <- here("data", "raw", "AQI.csv") %>%
    read_csv()
stations_tbl <- here("data", "raw", "stations.csv") %>%
    read_csv()

Importing data

1 2 3 4 5 6

If needed, use read_excel from the readxl package to read Excel files

library(readxl)
AQI_tbl <- here("data", "raw", "AQI.xlsx") %>%
    read_excel(sheet = "2")

Analysis Tasks

1 2 3 4 5 6

Find all outlier observations in AQI

Which stations are likely to have these?
Which month/season these outliers are likely to occur?
Do these outliers occur in the morning, afternoon or in the evening?
Does the definition of outlier matter?

For each station

Calculate the daily, weekly, monthly, seasonal summaries (mean, median, std. dev etc.)
How does temporal aggregation affect the air quality story?

More data manipulation

1 2 3 4 5 6

Joining data

1 2 3 4 5 6

R has many useful functions for handling relational data

all you need is at least one key variable that connects data sets

left_join is most common, but there are more

Joining data

1 2 3 4 5 6

stations_tbl <- stations_tbl %>%
      mutate(SITEID = as.character(SITEID))
# Why do we need to do this?


AQI_tbl %>% 
  left_join(stations_tbl,
            by = c("SITEID" = "SITEID"))

TIME	SITEID	AQI	GRADE	AIRQUALITY	PPITEMID	SITENAME	ORDERID	ISCITY	ISDISTRICT	ISNATIONAL	ISACTIVE	SITENAMEEN	ADDRESS	ISPUBLIC	LON	LAT
2014/7/1 0:00	201	78	2	良	9	普陀监测站	1	-1	-1	-1	-1	Putuo Station	普陀区杏山路317号曹杨社区文化中心	-1	121.4060	31.23642
2014/7/1 0:00	209	84	2	良	9	杨浦四漂	2	-1	-1	-1	-1	Yangpu Sipiao	杨浦区平凉路1398号	-1	121.5381	31.26435
2014/7/1 0:00	185	69	2	良	9	卢湾师专附小	3	-1	-1	-1	-1	Luwan Shizhuan	卢湾区局门路478号	-1	121.4818	31.20429
2014/7/1 0:00	203	73	2	良	9	青浦淀山湖	4	0	-1	-1	-1	Qingpu Dianshanhu	青浦区淀峰渔民村1号	-1	120.9820	31.08969

Joining data

1 2 3 4 5 6

right_join is all rows in y, all columns in x and y.

Joining data

1 2 3 4 5 6

inner_join is rows with matching values in x and y, and all columns from x and y.

Joining data

1 2 3 4 5 6

semi_join is rows with matching values in x and y, and only columns in x

Joining data

1 2 3 4 5 6

full_join returns all rows and all columns from both x and y.

Manipulating Strings

1 2 3 4 5 6

str_* are various useful functions when paired with regular expressions.

stations_tbl %>%
    mutate(street_number = str_extract(ADDRESS, "[0-9]+"))

SITEID	ADDRESS	street_number
215	虹口区凉城路854号凉城社区卫生中心	854
201	普陀区杏山路317号曹杨社区文化中心	317
209	杨浦区平凉路1398号	1398
185	卢湾区局门路478号	478
203	青浦区淀峰渔民村1号	1
183	静安区武定西路1480号	1480
207	徐汇区桂林路100号上师大	100
193	浦东新区川环南路319号	319
228	浦东新区灵山路51号	51
195	浦东新区祖冲之路295号	295

Manipulating Strings

1 2 3 4 5 6

stations_tbl %>%
    mutate(count_numbers = str_count(ADDRESS, "[0-9]"))

SITEID	ADDRESS	count_numbers
215	虹口区凉城路854号凉城社区卫生中心	3
201	普陀区杏山路317号曹杨社区文化中心	3
209	杨浦区平凉路1398号	4
185	卢湾区局门路478号	3
203	青浦区淀峰渔民村1号	1
183	静安区武定西路1480号	4
207	徐汇区桂林路100号上师大	3
193	浦东新区川环南路319号	3
228	浦东新区灵山路51号	2
195	浦东新区祖冲之路295号	3

Regular Expressions

1 2 3 4 5 6

Practise with https://regexone.com/

```
   [abc]   Only a, b, or c
```
```
   [^abc]  Not a, b, nor c
```
```
   [a-z]   Characters a to z
```
```
   [0-9]   Numbers 0 to 9
```
```
   \w  Any Alphanumeric character
```
```
  \W  Any Non-alphanumeric character
```
```
   {m}     m Repetitions
```
```
   {m,n}   m to n Repetitions
```
```
   *   Zero or more repetitions
```
```
   +   One or more repetitions
```
```
   ?   Optional character
```

A peek at next week

1 2 3 4 5 6

:::

A peek at next week

1 2 3 4 5 6

Introduction to Urban Analytics with R

Urban Analytics

Today’s plan

What is R? What is Rstudio ?

What is R?

Getting R & Rstudio

Navigating RStudio

Navigating RStudio

What is tidyverse?

Using R with tidyverse

Best way to learn the tidyverse

Rectangular Data

Rectangular Data

Today’s data

Today’s data

Today’s data

Today’s data

Today’s data

Today’s data

Useful operators & keyboard shortcuts

Syntax of a function

Read external data

Basic data manipulation

Common functions

Mutate

Mutate

A digression into factors

Mutate

Filter

Filter

Filter

Select

Select

Select

Arrange

Arrange

Arrange

Count

Count

Group by / summarize

Group by / summarize

Using R effectively

Working in RStudio

Working in RStudio

Working with packages

Organizing with projects

Organizing with projects

Organizing with projects

Always here

Always here

Ask for help!

Ask for help!

Ask for help! (Not like this)

Working with the data

Importing data

Importing data

Importing data

Analysis Tasks

More data manipulation

Joining data

Joining data

Joining data

Joining data

Joining data

Joining data

Manipulating Strings

Manipulating Strings

Regular Expressions

A peek at next week

A peek at next week

Thank You!