Scraping Craigslist Posts

Boeing, G. and P. Waddell. 2017. “New Insights into Rental Housing Markets across the United States: Web Scraping and Analyzing Craigslist Rental Listings.” Journal of Planning Education and Research, 37 (4), 457-476. doi:10.1177/0739456X16664789

Getting Started

When the data is unstructured or seemingly unstructured, we can still use R to create some structure. In this post I am going to demonstrate how to scrape a html page to extract the relevant information and convert them into a table for further analysis. In this particular example, we want to use Craigslist. Usual disclaimers about data retrieval and storage apply. Please consult a lawyer, especially if you use the data for non-research purposes. Also you can follow along the evolving landscape of the media law on this topic.

Additional Resources

Understanding the structure of the query

In this post, I am going to demonstrate how to scrape and assemble the rental listings in Boston. To start, it is useful to use your browser to access the webpage.

https://boston.craigslist.org/d/apts-housing-for-rent/search/apa

The results should look like this

On the left hand corner, you will notice that there are forms that you can use limit the search results. For example,

https://boston.craigslist.org/search/apa?postedToday=1&max_price=2000&availabilityMode=0&broker_fee=1&sale_date=all+dates

refers to a search with

  1. “posted today” is checked
  2. “maximum price” is “2000 usd”
  3. “availability” is set to “all dates”
  4. “no broker fee” is checked
  5. “open house date” is set to “all dates”

Few things to note.

  1. Cities are subdomains. i.e. if you need to information about Raleigh, you need to use https://raleigh.craigslist.org/
  2. The second bit of it /d/apts-housing-for-rent/search/apa is same for all cities but will change if you want to scrape something other than apartments.
  3. The browser uses a https protocol instead of a http protocol. This is more secure, but occasionally poses problems for accessing the web using curl instead of a browser. cURL is a command-line tool for getting or sending data including files using URL syntax often used explicitly within scripts that are used for web scraping. We are going to use rvest which takes care of this issue.
  4. The arguments start after ?.
  5. Match the options checked on the browser to the variables in the url and the values they take. For example, postedToday and broker_fee are set to 1 when corresponding boxes are checked.
  6. Variables are concatenated using & in the url string.
  7. It looks like the spaces in all dates are replaced with a +. Usually they are replaced by %20 a hexadecimal code for space. It is useful to understand the hex codes for special characters.

Based on the understanding of the above url, we can use conventional string manipulation to construct the url in R as follows.


library(tidyverse)
library(rvest)
library(tmap)

location <- 'boston'
bedrooms <- 2
bathrooms <- 2
min_sqft <- 900

baseurl <- paste0("https://", location, ".craigslist.org/search/apa")

# Build out the query
queries <- vector("character", 0) # Initialise a vector
queries[1] <- paste0("bedrooms=", bedrooms)
queries[2] <- paste0("bathrooms=", bathrooms)
queries[3] <- paste0("minSqft=", min_sqft)

(query_url <- paste0(baseurl,"?", paste(queries, collapse = "&")))
# [1] "https://boston.craigslist.org/search/apa?bedrooms=2&bathrooms=2&minSqft=900"

Exercise

Try out the different urls and understand how the query to the server works. i.e. try different cities, different types of ads and different arguments to the queries. It is useful to construct different urls by typing them out so that you understand the syntax.


Figuring out the structure of the results

An url request to the server usually results in an HTML page that is rendered in the browser. In other posts, we have seen how the URL may result in JSON objects which are much easier to deal with to create a structured dataset. But HTML pages are often with some structure that are used to render the page. Sometimes, we can take advantage of that structure to infer and extract what we need.

To examine the structure, we will have to use the developer tools in the web browser. In the rest of the tutorial, I am going to use Firefox as a browser, though analogous tools can be found for other browsers. You can open the Firefox Developer Tools from the menu by selecting Tools > Web Developer > Toggle Tools or use the keyboard shortcut Ctrl + Shift + I or F12 on Windows and Linux, or Cmd + Opt + I on macOS.

The most useful thing for this exercise is the page inspector, usually in the bottom left. You can right click on any post in the craigslist result page and use inspect element to understand what the structure looks like. Or alternately, you can point to various elements in the inspector and see different elements highlighted in the browser window.

In this particular instance each posted ad seems to be within a list, with a list element <li> with a class cl-search-result, with a data-pid associated with it.

That is our way in to scraping and structuring the data. To take advantage of the HTML elements, you need to become somewhat familiar with css selectors. You can use a fun game to get started.

Selenium

Selenium is a library and tool used for automating web browsers to do a number of tasks. One of such is web-scraping to extract useful data and information that may be otherwise unavailable. Craigslist has recently made it not possible to use rvest to scrape the data, by requiring only a browser with javascript capabilities to query the webpage

In such cases, you can use RSelenium to scrape the data. Rselenium is a binding for Selenium framework where a web browser can be controlled programmatically to navigate and interact with web pages. It is useful to use a headless browser, which is a browser that does not have a graphical user interface. This is useful when you are running the code on a server or in a cloud environment. But for this work, you need to have appropriate ports not blocked by firewalls. We use netstat package to find a free port to use. wdman is a package that helps manage the webdriver.

#install.packages("RSelenium")
library(RSelenium)
library(wdman)
library(netstat)


#remDr$close() # Close the browser if you had one started before.

rD <- rsDriver(browser = 'firefox',
               verbose =F,
               chromever = NULL,
               port = free_port()) # Sometimes these ports are blocked by the firewall. You can change the port number to something else. Or check with your system administrator.
remDr <- rD[["client"]]

remDr$navigate(query_url) # Navigate to the url

Sys.sleep(2) # give the page time to fully load
html_page <- remDr$getPageSource()[[1]]
#  # Close the browser whenever you are done with it.

Using rvest to extract data from html

The workhorse functions we are going to use are html_elements and html_element from the rvest library. html_element always extracts exactly one element. In this instance we want to extract all elements that are of cl-search-result class

#(raw_query <- read_html(query_url))

raw_query <- read_html(html_page)

## Select out the listing ads
raw_ads <- html_elements(raw_query, "li.cl-search-result")
raw_ads %>% head()
# {xml_nodeset (6)}
# [1] <li data-pid="7795984275" class="cl-search-result cl-search-view-mode-gal ...
# [2] <li data-pid="7795983719" class="cl-search-result cl-search-view-mode-gal ...
# [3] <li data-pid="7795982750" class="cl-search-result cl-search-view-mode-gal ...
# [4] <li data-pid="7795982286" class="cl-search-result cl-search-view-mode-gal ...
# [5] <li data-pid="7795981274" class="cl-search-result cl-search-view-mode-gal ...
# [6] <li data-pid="7795979820" class="cl-search-result cl-search-view-mode-gal ...

In the following bit of code, I extract the entire list of attributes that are part of each result-row. In particular, I want to extact id, title, price, date and locale. Notice how each of them require some special manipulation to get into a right format.

ids <-
  raw_ads %>%
  html_attr('data-pid')

titles <-
  raw_ads %>%
   html_element("a.posting-title") %>%
   html_text()

prices <-
   raw_ads %>% 
     html_element("span.priceinfo") %>%
     html_text() %>%
     str_replace_all("\\$|,+", "") %>% # This is a function that includes a regular expression to extract a special symbol $ and , and replace them with nothing.
     as.numeric()

metadata <-
  raw_ads%>%
  html_element('div.meta') %>%
  html_text()

Exercise

  • Notice the different functions html_attr and html_text that are used in different situations. Can you explain which is used when?

  • Examine when you need to use “.” within html_node and when you don’t?

  • Parse the metadata to extract the date of posting, number of bedrooms, sq.ft and locale

  • Use selectorgadget Chrome extension or bookmarklet to figure out which elements to extract more easily.


We are treading dangerously here. Notice that we are assuming that each of these ads have same attributes that we can extact and are in the same places within the document and are properly encoded. It is always a good idea to spot check some of these results. In particular, an easy check is to see if the lengths of these variables are the same. If they are, then more likely than not, we can column bind them into a table. If they are not, there is some error and it is likely that the prices wont match up with the descriptions or the number of bedrooms. I will leave that as an exercise, as to how you will spot and recover from these errors.

Extracting location attributes

Location of the apartments/houses are little bit more tricky because they are not embedded in the results page, but in each individual ad. To access them we need to access individual urls for the ads and then extract the location attributes.

To do this we use the map_* function in purrr package. The function is effectively like a for loop. So the following code for each of the element of urls use the function(x) and then combine them into a table using row binding.


urls <-
  raw_ads %>%
  html_element(".cl-app-anchor") %>%
  html_attr("href")
  

latlongs <- map_dfr(urls, function(x){
    remDr$navigate(x) # Navigate to the url
    Sys.sleep(1) # give the page time to fully load
    html_page2 <- remDr$getPageSource()[[1]]
    xml2::read_html(html_page2) %>%   
      html_element("#map") %>%
      html_attrs() %>%
      t() %>%
      as_tibble() %>%
      select_at(vars(starts_with("data-"))) %>%
      mutate_all(as.numeric)
  }
  )

latlongs
# # A tibble: 120 × 3
#    `data-latitude` `data-longitude` `data-accuracy`
#              <dbl>            <dbl>           <dbl>
#  1            42.2            -71.2              22
#  2            42.1            -71.2              22
#  3            42.2            -71.0              10
#  4            42.5            -71.2              22
#  5            42.4            -71.2              10
#  6            42.5            -71.0              22
#  7            42.5            -71.4              22
#  8            42.4            -71.0              10
#  9            42.4            -71.0              22
# 10            42.4            -71.1              10
# # ℹ 110 more rows

Exercise

  • It is useful to go through each step of the function and see why they result in the lat longs finally.

Geocoding addresses

Notice the data-accuracy field in the results. It is not clear, if the lat longs provided by the craigslist are accurate. It is useful to geocode the addresses to get a more accurate location. We first extract the addresses that are in each page and then geocode them using tidygeocoder package.


addresses <- map_dfr(urls, function(x){
    remDr$navigate(x) # Navigate to the url
    Sys.sleep(1) # give the page time to fully load and don't run afoul of rate limits
    html_page2 <- remDr$getPageSource()[[1]]
    
    id <- str_split(x, "(.*)/")[[1]][2] %>% tools::file_path_sans_ext() 
    
    title <- read_html(html_page2) %>%   
      html_element("span#titletextonly") %>%
      html_text()
    
    addr <- read_html(html_page2) %>%   
      html_element("h2.street-address") %>%
      html_text()
    
    return(bind_cols(pid = id, title = title, full_addr = addr))
    
  }
  )

head(addresses)
# # A tibble: 6 × 3
#   pid        title                                                     full_addr
#   <chr>      <chr>                                                     <chr>    
# 1 7795984275 Working privacy pod, Patio / balcony, Undermount sink, L… 8 Upland…
# 2 7795983719 9 ft ceilings, Dual vanity, Gas grilling areas, Billiards 400 Foxb…
# 3 7795982750 Air Conditioning, FreeWeights, Refrigerator, Attached ga… 550 Libe…
# 4 7795982286 Build credit score with RentTrack, Moving services with … 1 Inwood…
# 5 7795981274 Tile backsplash, Smart Home + GiGstreem Internet, Billia… 36 River…
# 6 7795979820 Air conditioning, Walking trail, Garages, Loft-style hom… 1000 Cra…

library(tidygeocoder)
library(sf)

  addresses_geocode <- addresses %>%
    geocode_combine(
      queries = list(list(method = 'census'), list(method = 'arcgis')),
      global_params = list(address = 'full_addr'), cascade = TRUE)
  

# Now Check to see if the distances between lat long and craigslist lat long 
  


craigslist_table <- cbind(ids, titles, urls, latlongs, prices) %>% 
                    as_tibble

craigslist_table <- craigslist_table %>% 
  left_join(addresses_geocode, by=c("ids" = "pid"))

craigslist_table
# # A tibble: 120 × 12
#    ids      titles urls  `data-latitude` `data-longitude` `data-accuracy` prices
#    <chr>    <chr>  <chr>           <dbl>            <dbl>           <dbl>  <dbl>
#  1 7795984… Worki… http…            42.2            -71.2              22   3216
#  2 7795983… 9 ft … http…            42.1            -71.2              22   3124
#  3 7795982… Air C… http…            42.2            -71.0              10   3419
#  4 7795982… Build… http…            42.5            -71.2              22   2870
#  5 7795981… Tile … http…            42.4            -71.2              10   3225
#  6 7795979… Air c… http…            42.5            -71.0              22   3333
#  7 7795978… Wood … http…            42.5            -71.4              22   2350
#  8 7795978… Stora… http…            42.4            -71.0              10   3100
#  9 7795936… 4 bed… http…            42.4            -71.0              22   4500
# 10 7795912… Davis… http…            42.4            -71.1              10   3200
# # ℹ 110 more rows
# # ℹ 5 more variables: title <chr>, full_addr <chr>, lat <dbl>, long <dbl>,
# #   query <chr>

craigslist_table$dist_err <- diag(geosphere::distm(craigslist_table[,c("data-longitude", "data-latitude")], craigslist_table[,c("long", "lat")]))


craigslist_table %>% arrange(-`data-accuracy`) %>% glimpse()
# Rows: 120
# Columns: 13
# $ ids              <chr> "7793439550", "7795984275", "7795983719", "7795982286…
# $ titles           <chr> "2-3 bed, 1-2 bath condo- 2 units", "Working privacy …
# $ urls             <chr> "https://boston.craigslist.org/gbs/apa/d/belmont-3-be…
# $ `data-latitude`  <dbl> 42.3960, 42.1868, 42.0649, 42.4829, 42.5326, 42.4567,…
# $ `data-longitude` <dbl> -71.1820, -71.2033, -71.2441, -71.1574, -70.9612, -71…
# $ `data-accuracy`  <dbl> 99, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 2…
# $ prices           <dbl> 3400, 3216, 3124, 2870, 3333, 2350, 4500, 4200, 2995,…
# $ title            <chr> "2-3 bed, 1-2 bath condo- 2 units", "Working privacy …
# $ full_addr        <chr> NA, "8 Upland Woods Cir,, Norwood, MA 02062", "400 Fo…
# $ lat              <dbl> NA, 42.21076, 42.04648, 42.52207, 42.55445, 42.43790,…
# $ long             <dbl> NA, -71.19273, -71.23641, -71.13192, -70.97567, -71.4…
# $ query            <chr> "", "census", "arcgis", "arcgis", "census", "census",…
# $ dist_err         <dbl> NA, 2800.6538, 2142.9896, 4829.4004, 2702.8893, 4948.…

Notice all the missing addresses with ‘high-accuracy’. Clearly the text address is an incomplete data. However, can you trust the CL created longitude and latitude?


library(ggplot2)

ggplot() +
  geom_point(aes(x = `data-accuracy`, y = `dist_err`), data = craigslist_table) +
  labs(x = "Data Accuracy", y = "Distance Error")

Exercise

  • What can you infer from this graph? How does this condition your analysis further?

Now we are ready to combine them all into one table and visualise the results.


tmap_mode("view")

craigslist_sf <- st_as_sf(craigslist_table, coords = c("data-longitude", "data-latitude"), crs = 4326)

m1 <-
  tm_shape(craigslist_sf) +
  tm_dots(col = "prices") +
  tm_basemap("CartoDB.Positron") 


library(widgetframe)
frameWidget(tmap_leaflet(m1))

The above code only extracts the first page of the results. To extract all the results, we need to navigate to the next page and extract the results. We can do this by clicking on the next button and then extracting the results. We can do this programmatically using Selenium and finding the button element and clicking on it.

remDr$navigate(query_url)
nextbutton <- remDr$findElement(using = 'css selector', value = 'button.cl-next-page')
nextbutton$clickElement()
Sys.sleep(3)
html_page2 <- remDr$getPageSource()[[1]]
raw_ads <- html_elements(read_html(html_page2), "li.cl-search-result")
raw_ads %>% head()
# {xml_nodeset (6)}
# [1] <li data-pid="7791523666" class="cl-search-result cl-search-view-mode-gal ...
# [2] <li data-pid="7791436053" class="cl-search-result cl-search-view-mode-gal ...
# [3] <li data-pid="7791435513" class="cl-search-result cl-search-view-mode-gal ...
# [4] <li data-pid="7791435026" class="cl-search-result cl-search-view-mode-gal ...
# [5] <li data-pid="7791434008" class="cl-search-result cl-search-view-mode-gal ...
# [6] <li data-pid="7791432435" class="cl-search-result cl-search-view-mode-gal ...

Exercise

  • Figure out how to loop through various pages to extract all the results. Do not repeat the code again and again, but try and use loops or functions.Figure out when you should end your loop. The more quickly you ping the server with your queries, you will run afoul of rate limits and may raise some eyebrows in San Francisco. Use Sys.sleep to slow down the query rate.

  • Repeat this for few other urban areas, such as Los Angeles and Raleigh.


Cautions & Conclusions

Scraping the web to create a structured dataset is as much an art as it is a technical exercise. It is about figuring out the right combinations of searches and manipulations that will get you right result. And because there are no documentation and manuals, it is imperative to experiment with it and fail. And you will fail often. The webpages change their structure over time and the code that might work one week may not work the next. The webpages themselves are dynamic and therefore the dataset that you generate may not be replicable. All these should be taken into account when you are scraping the web for data and conducting your analyses.

Nikhil Kaza
Nikhil Kaza
Professor

My research interests include urbanization patterns, local energy policy and equity

Related