Scraping Craigslist Posts
Getting Started
When the data is unstructured or seemingly unstructured, we can still use R to create some structure. In this post I am going to demonstrate how to scrape a html page to extract the relevant information and convert them into a table for further analysis. In this particular example, we want to use Craigslist. Usual disclaimers about data retrieval and storage apply. Please consult a lawyer, especially if you use the data for non-research purposes. Also you can follow along the evolving landscape of the media law on this topic.
Additional Resources
-
Munzert, Simon, Christian Rubba, Peter Meißner, and Dominic Nyhuis. 2014. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. 1 edition. Chichester, West Sussex, United Kingdom: Wiley.
-
Pittard, Steve. 2019. Web Scraping with R.
Understanding the structure of the query
In this post, I am going to demonstrate how to scrape and assemble the rental listings in Boston. To start, it is useful to use your browser to access the webpage.
https://boston.craigslist.org/d/apts-housing-for-rent/search/apa
The results should look like this
On the left hand corner, you will notice that there are forms that you can use limit the search results. For example,
https://boston.craigslist.org/search/apa?postedToday=1&max_price=2000&availabilityMode=0&broker_fee=1&sale_date=all+dates
refers to a search with
- “posted today” is checked
- “maximum price” is “2000 usd”
- “availability” is set to “all dates”
- “no broker fee” is checked
- “open house date” is set to “all dates”
Few things to note.
- Cities are subdomains. i.e. if you need to information about Raleigh, you need to use
https://raleigh.craigslist.org/
- The second bit of it
/d/apts-housing-for-rent/search/apa
is same for all cities but will change if you want to scrape something other than apartments. - The browser uses a
https
protocol instead of ahttp
protocol. This is more secure, but occasionally poses problems for accessing the web usingcurl
instead of a browser. cURL is a command-line tool for getting or sending data including files using URL syntax often used explicitly within scripts that are used for web scraping. We are going to uservest
which takes care of this issue. - The arguments start after
?
. - Match the options checked on the browser to the variables in the url and the values they take. For example,
postedToday
andbroker_fee
are set to 1 when corresponding boxes are checked. - Variables are concatenated using
&
in the url string. - It looks like the spaces in
all dates
are replaced with a+
. Usually they are replaced by%20
a hexadecimal code for space. It is useful to understand the hex codes for special characters.
Based on the understanding of the above url, we can use conventional string manipulation to construct the url in R as follows.
library(tidyverse)
library(rvest)
library(tmap)
location <- 'boston'
bedrooms <- 2
bathrooms <- 2
min_sqft <- 900
baseurl <- paste0("https://", location, ".craigslist.org/search/apa")
# Build out the query
queries <- vector("character", 0) # Initialise a vector
queries[1] <- paste0("bedrooms=", bedrooms)
queries[2] <- paste0("bathrooms=", bathrooms)
queries[3] <- paste0("minSqft=", min_sqft)
(query_url <- paste0(baseurl,"?", paste(queries, collapse = "&")))
# [1] "https://boston.craigslist.org/search/apa?bedrooms=2&bathrooms=2&minSqft=900"
Exercise
Try out the different urls and understand how the query to the server works. i.e. try different cities, different types of ads and different arguments to the queries. It is useful to construct different urls by typing them out so that you understand the syntax.
Figuring out the structure of the results
An url request to the server usually results in an HTML page that is rendered in the browser. In other posts, we have seen how the URL may result in JSON objects which are much easier to deal with to create a structured dataset. But HTML pages are often with some structure that are used to render the page. Sometimes, we can take advantage of that structure to infer and extract what we need.
To examine the structure, we will have to use the developer tools in the web browser. In the rest of the tutorial, I am going to use Firefox as a browser, though analogous tools can be found for other browsers. You can open the Firefox Developer Tools from the menu by selecting Tools > Web Developer > Toggle Tools or use the keyboard shortcut Ctrl + Shift + I or F12 on Windows and Linux, or Cmd + Opt + I on macOS.
The most useful thing for this exercise is the page inspector, usually in the bottom left. You can right click on any post in the craigslist result page and use inspect element to understand what the structure looks like. Or alternately, you can point to various elements in the inspector and see different elements highlighted in the browser window.
In this particular instance each posted ad seems to be within a list, with a list element <li>
with a class cl-search-result
, with a data-pid
associated with it.
That is our way in to scraping and structuring the data. To take advantage of the HTML elements, you need to become somewhat familiar with css selectors. You can use a fun game to get started.
Selenium
Selenium is a library and tool used for automating web browsers to do a number of tasks. One of such is web-scraping to extract useful data and information that may be otherwise unavailable. Craigslist has recently made it not possible to use rvest
to scrape the data, by requiring only a browser with javascript capabilities to query the webpage
In such cases, you can use RSelenium
to scrape the data. Rselenium is a binding for Selenium framework where a web browser can be controlled programmatically to navigate and interact with web pages. It is useful to use a headless browser, which is a browser that does not have a graphical user interface. This is useful when you are running the code on a server or in a cloud environment. But for this work, you need to have appropriate ports not blocked by firewalls. We use netstat
package to find a free port to use. wdman
is a package that helps manage the webdriver.
#install.packages("RSelenium")
library(RSelenium)
library(wdman)
library(netstat)
#remDr$close() # Close the browser if you had one started before.
rD <- rsDriver(browser = 'firefox',
verbose =F,
chromever = NULL,
port = free_port()) # Sometimes these ports are blocked by the firewall. You can change the port number to something else. Or check with your system administrator.
remDr <- rD[["client"]]
remDr$navigate(query_url) # Navigate to the url
Sys.sleep(2) # give the page time to fully load
html_page <- remDr$getPageSource()[[1]]
# # Close the browser whenever you are done with it.
Using rvest to extract data from html
The workhorse functions we are going to use are html_elements
and html_element
from the rvest
library. html_element
always extracts exactly one element. In this instance we want to extract all elements that are of cl-search-result
class
#(raw_query <- read_html(query_url))
raw_query <- read_html(html_page)
## Select out the listing ads
raw_ads <- html_elements(raw_query, "li.cl-search-result")
raw_ads %>% head()
# {xml_nodeset (6)}
# [1] <li data-pid="7795984275" class="cl-search-result cl-search-view-mode-gal ...
# [2] <li data-pid="7795983719" class="cl-search-result cl-search-view-mode-gal ...
# [3] <li data-pid="7795982750" class="cl-search-result cl-search-view-mode-gal ...
# [4] <li data-pid="7795982286" class="cl-search-result cl-search-view-mode-gal ...
# [5] <li data-pid="7795981274" class="cl-search-result cl-search-view-mode-gal ...
# [6] <li data-pid="7795979820" class="cl-search-result cl-search-view-mode-gal ...
In the following bit of code, I extract the entire list of attributes that are part of each result-row. In particular, I want to extact id, title, price, date and locale. Notice how each of them require some special manipulation to get into a right format.
ids <-
raw_ads %>%
html_attr('data-pid')
titles <-
raw_ads %>%
html_element("a.posting-title") %>%
html_text()
prices <-
raw_ads %>%
html_element("span.priceinfo") %>%
html_text() %>%
str_replace_all("\\$|,+", "") %>% # This is a function that includes a regular expression to extract a special symbol $ and , and replace them with nothing.
as.numeric()
metadata <-
raw_ads%>%
html_element('div.meta') %>%
html_text()
Exercise
-
Notice the different functions
html_attr
andhtml_text
that are used in different situations. Can you explain which is used when? -
Examine when you need to use “.” within html_node and when you don’t?
-
Parse the metadata to extract the date of posting, number of bedrooms, sq.ft and locale
-
Use selectorgadget Chrome extension or bookmarklet to figure out which elements to extract more easily.
Extracting location attributes
Location of the apartments/houses are little bit more tricky because they are not embedded in the results page, but in each individual ad. To access them we need to access individual urls for the ads and then extract the location attributes.
To do this we use the map_*
function in purrr
package. The function is effectively like a for loop. So the following code for each of the element of urls use the function(x) and then combine them into a table using row binding.
urls <-
raw_ads %>%
html_element(".cl-app-anchor") %>%
html_attr("href")
latlongs <- map_dfr(urls, function(x){
remDr$navigate(x) # Navigate to the url
Sys.sleep(1) # give the page time to fully load
html_page2 <- remDr$getPageSource()[[1]]
xml2::read_html(html_page2) %>%
html_element("#map") %>%
html_attrs() %>%
t() %>%
as_tibble() %>%
select_at(vars(starts_with("data-"))) %>%
mutate_all(as.numeric)
}
)
latlongs
# # A tibble: 120 × 3
# `data-latitude` `data-longitude` `data-accuracy`
# <dbl> <dbl> <dbl>
# 1 42.2 -71.2 22
# 2 42.1 -71.2 22
# 3 42.2 -71.0 10
# 4 42.5 -71.2 22
# 5 42.4 -71.2 10
# 6 42.5 -71.0 22
# 7 42.5 -71.4 22
# 8 42.4 -71.0 10
# 9 42.4 -71.0 22
# 10 42.4 -71.1 10
# # ℹ 110 more rows
Exercise
- It is useful to go through each step of the function and see why they result in the lat longs finally.
Geocoding addresses
Notice the data-accuracy
field in the results. It is not clear, if the lat longs provided by the craigslist are accurate. It is useful to geocode the addresses to get a more accurate location. We first extract the addresses that are in each page and then geocode them using tidygeocoder
package.
addresses <- map_dfr(urls, function(x){
remDr$navigate(x) # Navigate to the url
Sys.sleep(1) # give the page time to fully load and don't run afoul of rate limits
html_page2 <- remDr$getPageSource()[[1]]
id <- str_split(x, "(.*)/")[[1]][2] %>% tools::file_path_sans_ext()
title <- read_html(html_page2) %>%
html_element("span#titletextonly") %>%
html_text()
addr <- read_html(html_page2) %>%
html_element("h2.street-address") %>%
html_text()
return(bind_cols(pid = id, title = title, full_addr = addr))
}
)
head(addresses)
# # A tibble: 6 × 3
# pid title full_addr
# <chr> <chr> <chr>
# 1 7795984275 Working privacy pod, Patio / balcony, Undermount sink, L… 8 Upland…
# 2 7795983719 9 ft ceilings, Dual vanity, Gas grilling areas, Billiards 400 Foxb…
# 3 7795982750 Air Conditioning, FreeWeights, Refrigerator, Attached ga… 550 Libe…
# 4 7795982286 Build credit score with RentTrack, Moving services with … 1 Inwood…
# 5 7795981274 Tile backsplash, Smart Home + GiGstreem Internet, Billia… 36 River…
# 6 7795979820 Air conditioning, Walking trail, Garages, Loft-style hom… 1000 Cra…
library(tidygeocoder)
library(sf)
addresses_geocode <- addresses %>%
geocode_combine(
queries = list(list(method = 'census'), list(method = 'arcgis')),
global_params = list(address = 'full_addr'), cascade = TRUE)
# Now Check to see if the distances between lat long and craigslist lat long
craigslist_table <- cbind(ids, titles, urls, latlongs, prices) %>%
as_tibble
craigslist_table <- craigslist_table %>%
left_join(addresses_geocode, by=c("ids" = "pid"))
craigslist_table
# # A tibble: 120 × 12
# ids titles urls `data-latitude` `data-longitude` `data-accuracy` prices
# <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 7795984… Worki… http… 42.2 -71.2 22 3216
# 2 7795983… 9 ft … http… 42.1 -71.2 22 3124
# 3 7795982… Air C… http… 42.2 -71.0 10 3419
# 4 7795982… Build… http… 42.5 -71.2 22 2870
# 5 7795981… Tile … http… 42.4 -71.2 10 3225
# 6 7795979… Air c… http… 42.5 -71.0 22 3333
# 7 7795978… Wood … http… 42.5 -71.4 22 2350
# 8 7795978… Stora… http… 42.4 -71.0 10 3100
# 9 7795936… 4 bed… http… 42.4 -71.0 22 4500
# 10 7795912… Davis… http… 42.4 -71.1 10 3200
# # ℹ 110 more rows
# # ℹ 5 more variables: title <chr>, full_addr <chr>, lat <dbl>, long <dbl>,
# # query <chr>
craigslist_table$dist_err <- diag(geosphere::distm(craigslist_table[,c("data-longitude", "data-latitude")], craigslist_table[,c("long", "lat")]))
craigslist_table %>% arrange(-`data-accuracy`) %>% glimpse()
# Rows: 120
# Columns: 13
# $ ids <chr> "7793439550", "7795984275", "7795983719", "7795982286…
# $ titles <chr> "2-3 bed, 1-2 bath condo- 2 units", "Working privacy …
# $ urls <chr> "https://boston.craigslist.org/gbs/apa/d/belmont-3-be…
# $ `data-latitude` <dbl> 42.3960, 42.1868, 42.0649, 42.4829, 42.5326, 42.4567,…
# $ `data-longitude` <dbl> -71.1820, -71.2033, -71.2441, -71.1574, -70.9612, -71…
# $ `data-accuracy` <dbl> 99, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 2…
# $ prices <dbl> 3400, 3216, 3124, 2870, 3333, 2350, 4500, 4200, 2995,…
# $ title <chr> "2-3 bed, 1-2 bath condo- 2 units", "Working privacy …
# $ full_addr <chr> NA, "8 Upland Woods Cir,, Norwood, MA 02062", "400 Fo…
# $ lat <dbl> NA, 42.21076, 42.04648, 42.52207, 42.55445, 42.43790,…
# $ long <dbl> NA, -71.19273, -71.23641, -71.13192, -70.97567, -71.4…
# $ query <chr> "", "census", "arcgis", "arcgis", "census", "census",…
# $ dist_err <dbl> NA, 2800.6538, 2142.9896, 4829.4004, 2702.8893, 4948.…
Notice all the missing addresses with ‘high-accuracy’. Clearly the text address is an incomplete data. However, can you trust the CL created longitude and latitude?
library(ggplot2)
ggplot() +
geom_point(aes(x = `data-accuracy`, y = `dist_err`), data = craigslist_table) +
labs(x = "Data Accuracy", y = "Distance Error")
Exercise
- What can you infer from this graph? How does this condition your analysis further?
Now we are ready to combine them all into one table and visualise the results.
tmap_mode("view")
craigslist_sf <- st_as_sf(craigslist_table, coords = c("data-longitude", "data-latitude"), crs = 4326)
m1 <-
tm_shape(craigslist_sf) +
tm_dots(col = "prices") +
tm_basemap("CartoDB.Positron")
library(widgetframe)
frameWidget(tmap_leaflet(m1))
Navigating to next page
The above code only extracts the first page of the results. To extract all the results, we need to navigate to the next page and extract the results. We can do this by clicking on the next button and then extracting the results. We can do this programmatically using Selenium and finding the button element and clicking on it.
remDr$navigate(query_url)
nextbutton <- remDr$findElement(using = 'css selector', value = 'button.cl-next-page')
nextbutton$clickElement()
Sys.sleep(3)
html_page2 <- remDr$getPageSource()[[1]]
raw_ads <- html_elements(read_html(html_page2), "li.cl-search-result")
raw_ads %>% head()
# {xml_nodeset (6)}
# [1] <li data-pid="7791523666" class="cl-search-result cl-search-view-mode-gal ...
# [2] <li data-pid="7791436053" class="cl-search-result cl-search-view-mode-gal ...
# [3] <li data-pid="7791435513" class="cl-search-result cl-search-view-mode-gal ...
# [4] <li data-pid="7791435026" class="cl-search-result cl-search-view-mode-gal ...
# [5] <li data-pid="7791434008" class="cl-search-result cl-search-view-mode-gal ...
# [6] <li data-pid="7791432435" class="cl-search-result cl-search-view-mode-gal ...
Exercise
-
Figure out how to loop through various pages to extract all the results. Do not repeat the code again and again, but try and use loops or functions.Figure out when you should end your loop. The more quickly you ping the server with your queries, you will run afoul of rate limits and may raise some eyebrows in San Francisco. Use
Sys.sleep
to slow down the query rate. -
Repeat this for few other urban areas, such as Los Angeles and Raleigh.
Cautions & Conclusions
Scraping the web to create a structured dataset is as much an art as it is a technical exercise. It is about figuring out the right combinations of searches and manipulations that will get you right result. And because there are no documentation and manuals, it is imperative to experiment with it and fail. And you will fail often. The webpages change their structure over time and the code that might work one week may not work the next. The webpages themselves are dynamic and therefore the dataset that you generate may not be replicable. All these should be taken into account when you are scraping the web for data and conducting your analyses.