Exploratory Data Analysis and Visualisation

Getting started with real data

In this tutorial, we are going to summarise Air Quality Index for about 4,000 stations in the US in 2017. These datasets are downloaded from the EPA website. A local copy of the dataset is here

library(tidyverse)

ozone_airq_df <- here('tutorials_datasets', "aqi", "daily_44201_2017.csv") %>% read_csv() #Change file paths accordingly 
dim(ozone_airq_df)
# [1] 402501     29
names(ozone_airq_df)
#  [1] "State Code"          "County Code"         "Site Num"           
#  [4] "Parameter Code"      "POC"                 "Latitude"           
#  [7] "Longitude"           "Datum"               "Parameter Name"     
# [10] "Sample Duration"     "Pollutant Standard"  "Date Local"         
# [13] "Units of Measure"    "Event Type"          "Observation Count"  
# [16] "Observation Percent" "Arithmetic Mean"     "1st Max Value"      
# [19] "1st Max Hour"        "AQI"                 "Method Code"        
# [22] "Method Name"         "Local Site Name"     "Address"            
# [25] "State Name"          "County Name"         "City Name"          
# [28] "CBSA Name"           "Date of Last Change"

If library(.) command returns an error, it is likely that you do not have the package installed on your computer. Install them by e.g. install.packages(tidyverse). You only need to install packages once. But you need to call them into your library in every R session.

Zhu Wang: I am trying to create a library which uses some Fortran source files […]

Douglas Bates: Someone named Martin Maechler will shortly be sending you email regarding the distinction between ‘library’ and ‘package’ :-)

   -- Zhu Wang and Douglas Bates: R-help (May 2004)

If read_csv results in an error, it is very likely that you got the path to the file wrong. Make sure that the data file exists at the path specified. File paths are tricky, especially when you want your code to be portable among different operating systems or even different computers. Three rules of thumb will help you with some headaches, later on.

  1. Use file.path() function to create platform independent file paths.
  2. Use relative paths
  3. Assign a variable to the string (e.g. input_file_path) and use it in the functions (such as read_csv(input_file_path))
  4. Better still use the here function from the here package and use R projects framework.

Exercise

  • Use the above rules to rewrite the code for reading the input data file.
  • If you have not used R project to start with, go back and follow the instructions about Rprojects and directory structure. Rewrite the code to read in the data from appropriate file path.

In the US, State and Counties are uniquely identified by a 2 and 3 digit FIPS (Federal Information Processing System) Code. Lets create it so that we can merge datasets together.

ozone_airq_df$stcofips <- paste(formatC(ozone_airq_df$`State Code`, width=2,flag="0"),formatC(ozone_airq_df$`County Code`, width=3,flag="0"), sep="" )

head(ozone_airq_df$stcofips) #Inspect the outcome
# [1] "01003" "01003" "01003" "01003" "01003" "01003"
tail(ozone_airq_df$stcofips)
# [1] "80026" "80026" "80026" "80026" "80026" "80026"
unique(nchar(ozone_airq_df$stcofips)) # Making sure all fips codes are of length 5 characters
# [1] 5

What we really need is a few columns. So we extract them

ozone_airq_df2 <- ozone_airq_df[,c('stcofips', 'Site Num', 'Latitude', 'Longitude', 'Date Local', 'State Name', 'County Name', 'CBSA Name', 'AQI')]
summary(ozone_airq_df2)
#    stcofips           Site Num            Latitude       Longitude      
#  Length:402501      Length:402501      Min.   :18.18   Min.   :-158.09  
#  Class :character   Class :character   1st Qu.:34.10   1st Qu.:-110.10  
#  Mode  :character   Mode  :character   Median :38.34   Median : -91.53  
#                                        Mean   :37.64   Mean   : -95.33  
#                                        3rd Qu.:41.06   3rd Qu.: -82.17  
#                                        Max.   :64.85   Max.   : -65.92  
#    Date Local          State Name        County Name         CBSA Name        
#  Min.   :2017-01-01   Length:402501      Length:402501      Length:402501     
#  1st Qu.:2017-04-12   Class :character   Class :character   Class :character  
#  Median :2017-06-30   Mode  :character   Mode  :character   Mode  :character  
#  Mean   :2017-06-30                                                           
#  3rd Qu.:2017-09-19                                                           
#  Max.   :2017-12-31                                                           
#       AQI        
#  Min.   :  0.00  
#  1st Qu.: 31.00  
#  Median : 38.00  
#  Mean   : 40.13  
#  3rd Qu.: 45.00  
#  Max.   :233.00
nrow(unique(ozone_airq_df2[,c('Latitude', 'Longitude')]))
# [1] 1265
length(unique(ozone_airq_df2$stcofips))
# [1] 786

i.e. There are 1,265 unique monitoring station locations in 786 counties. Roughly, on average 1.61 per county. However, there are about ~3100 counties in the US, suggesting that large parts of the US are not covered by the monitoring stations.

It is useful to rewrite the last command as following, using a pipe operator %>%

ozone_airq_df2$stcofips %>%
  unique() %>%
  length()
# [1] 786

Following this pipe operator mode helps avoid a lot of headaches associated with orphan brackets and make the code cleaner and easier to read.

You can also follow the tidyverse syntax.

ozone_airq_df3 <- select(ozone_airq_df, `stcofips`, `Site Num`, `Latitude`, `Longitude`, `Date Local`, `State Name`, `County Name`, `CBSA Name`, `AQI`) #W e only needed `` because of spaces in column names. 
all.equal(ozone_airq_df2, ozone_airq_df3)
# [1] TRUE
rm(ozone_airq_df3)

select is more powerful than [ because it has helper functions that will select columns based on pattern matching for example select(ozone_airq_df , ends_with(Name)) would select County Name, State Name and CBSA Name columns.

The space in the column names are problematic, so lets rename columns

ozone_airq_df2 <- rename(ozone_airq_df2, 
                 County_Name = `County Name` ,  State_Name = `State Name` ,  CBSA_Name = `CBSA Name`, dateL = `Date Local`, Site_Num = `Site Num`
                 )

Exercise

  • Check to make sure you can select using new column names
  • What are the data types for different columns?
  • How many NAs are there in different columns in the dataset?

Work horses of data manipulation

Order

This dataset has repeated observations per monitoring station. It would be nice if they were rearranged in a nested fashion, i.e. states | counties | Monitoring Station | Date

select(arrange(ozone_airq_df2, stcofips, desc(dateL)), stcofips, dateL, AQI) #select is only to demonstrate
# # A tibble: 402,501 × 3
#    stcofips dateL        AQI
#    <chr>    <date>     <dbl>
#  1 01003    2017-10-31    45
#  2 01003    2017-10-30    38
#  3 01003    2017-10-28    24
#  4 01003    2017-10-27    32
#  5 01003    2017-10-26    43
#  6 01003    2017-10-25    35
#  7 01003    2017-10-24    36
#  8 01003    2017-10-23    31
#  9 01003    2017-10-22    27
# 10 01003    2017-10-21    29
# # ℹ 402,491 more rows

We take advantage of the fact that FIPS are arranged in a natural order. Otherwise we could also do

select(arrange(ozone_airq_df2, State_Name, County_Name, desc(dateL)), stcofips, dateL, AQI) #select is only to demonstrate
# # A tibble: 402,501 × 3
#    stcofips dateL        AQI
#    <chr>    <date>     <dbl>
#  1 01003    2017-10-31    45
#  2 01003    2017-10-30    38
#  3 01003    2017-10-28    24
#  4 01003    2017-10-27    32
#  5 01003    2017-10-26    43
#  6 01003    2017-10-25    35
#  7 01003    2017-10-24    36
#  8 01003    2017-10-23    31
#  9 01003    2017-10-22    27
# 10 01003    2017-10-21    29
# # ℹ 402,491 more rows

Exercise

  • Use the %>% operator to rewrite the above commands. Note that you can use the pipe operator even when a function has multiple arguments. Use the help resources to understand how to write code using the pipe operator in multiple situations. (e.g. passing the argument in the second position of the function.)
  • Print the top 10 AQI values and their corresponding sites and dates.

Mutate

mutate creates new variables out of old ones. You’ll want to use mutate just about any time you want to create a new variable in your data frame or alter and overwrite an existing variable.

# Convert Longitude and Latitude to radians
mutate(ozone_airq_df2, 
                Latitude_rad = pi * Latitude / 180,
                Longitude_rad = pi * Longitude / 180) %>%
  select(contains('tude'))
# # A tibble: 402,501 × 4
#    Latitude Longitude Latitude_rad Longitude_rad
#       <dbl>     <dbl>        <dbl>         <dbl>
#  1     30.5     -87.9        0.532         -1.53
#  2     30.5     -87.9        0.532         -1.53
#  3     30.5     -87.9        0.532         -1.53
#  4     30.5     -87.9        0.532         -1.53
#  5     30.5     -87.9        0.532         -1.53
#  6     30.5     -87.9        0.532         -1.53
#  7     30.5     -87.9        0.532         -1.53
#  8     30.5     -87.9        0.532         -1.53
#  9     30.5     -87.9        0.532         -1.53
# 10     30.5     -87.9        0.532         -1.53
# # ℹ 402,491 more rows

If you want to mutate multiple columns at the same time use across function within mutate. In this example, we are searching for a string called “tude” in the column names and appl


convert_to_rad <- function(x){ifelse(!is.na(x), x/180*pi, NA)} # This if_else is redundant, used for demonstrating that you need to use defensive programming.


mutate(ozone_airq_df2, across(contains("tude"), convert_to_rad, .names = "{.col}_rad")) %>% 
        select(contains('tude')) # select is for demo only.
# # A tibble: 402,501 × 4
#    Latitude Longitude Latitude_rad Longitude_rad
#       <dbl>     <dbl>        <dbl>         <dbl>
#  1     30.5     -87.9        0.532         -1.53
#  2     30.5     -87.9        0.532         -1.53
#  3     30.5     -87.9        0.532         -1.53
#  4     30.5     -87.9        0.532         -1.53
#  5     30.5     -87.9        0.532         -1.53
#  6     30.5     -87.9        0.532         -1.53
#  7     30.5     -87.9        0.532         -1.53
#  8     30.5     -87.9        0.532         -1.53
#  9     30.5     -87.9        0.532         -1.53
# 10     30.5     -87.9        0.532         -1.53
# # ℹ 402,491 more rows

Exercise

  • Note that mutate itself does not change ozone_airq_df2. What if you wanted save these new variables, how would you achieve it.

  • Notice the end of parens is very important for this to work. What happens when you run this instead and why?

mutate(ozone_airq_df2, across(contains("tude")), convert_to_rad, .names = "{.col}_rad")

A useful thing to do is to create a unique id for each monitoring station. We can accomplish it using mutate.

ozone_airq_df2 <- mutate(ozone_airq_df2, SiteID = paste(stcofips, Site_Num, sep="_")) 

Exercise

  • Calculate the distance to the Center of the US (39.8333333,-98.585522 (in Kansas)) from each site. Haversine Distance is given by

where R is 6371 (radius of earth in km). Mind the issue of radians and degrees to the trigonometric functions and the parenthesis. Note also the use of two argument arctan function


Filter

Sometimes instead of subsetting columns, you want to subset rows. Filter is an useful command for these.

filter(ozone_airq_df2, stcofips == "06037") #Los Angeles County
# # A tibble: 4,574 × 10
#    stcofips Site_Num Latitude Longitude dateL      State_Name County_Name
#    <chr>    <chr>       <dbl>     <dbl> <date>     <chr>      <chr>      
#  1 06037    0002         34.1     -118. 2017-01-01 California Los Angeles
#  2 06037    0002         34.1     -118. 2017-01-02 California Los Angeles
#  3 06037    0002         34.1     -118. 2017-01-03 California Los Angeles
#  4 06037    0002         34.1     -118. 2017-01-04 California Los Angeles
#  5 06037    0002         34.1     -118. 2017-01-05 California Los Angeles
#  6 06037    0002         34.1     -118. 2017-01-06 California Los Angeles
#  7 06037    0002         34.1     -118. 2017-01-07 California Los Angeles
#  8 06037    0002         34.1     -118. 2017-01-08 California Los Angeles
#  9 06037    0002         34.1     -118. 2017-01-09 California Los Angeles
# 10 06037    0002         34.1     -118. 2017-01-10 California Los Angeles
# # ℹ 4,564 more rows
# # ℹ 3 more variables: CBSA_Name <chr>, AQI <dbl>, SiteID <chr>

Group By

Even though we rearranged the data, we did not yet split it to calculate group summaries. For example if you wanted to find out which monitoring stations had readings for less than 30 days you could do the following.

group_by(ozone_airq_df2, SiteID) %>%
  filter(n() < 30)
# # A tibble: 43 × 10
# # Groups:   SiteID [4]
#    stcofips Site_Num Latitude Longitude dateL      State_Name County_Name
#    <chr>    <chr>       <dbl>     <dbl> <date>     <chr>      <chr>      
#  1 17083    0117         39.1     -90.3 2017-10-03 Illinois   Jersey     
#  2 17083    0117         39.1     -90.3 2017-10-04 Illinois   Jersey     
#  3 17083    0117         39.1     -90.3 2017-10-05 Illinois   Jersey     
#  4 17083    0117         39.1     -90.3 2017-10-06 Illinois   Jersey     
#  5 17083    0117         39.1     -90.3 2017-10-07 Illinois   Jersey     
#  6 17083    0117         39.1     -90.3 2017-10-08 Illinois   Jersey     
#  7 17083    0117         39.1     -90.3 2017-10-09 Illinois   Jersey     
#  8 17083    0117         39.1     -90.3 2017-10-10 Illinois   Jersey     
#  9 17083    0117         39.1     -90.3 2017-10-11 Illinois   Jersey     
# 10 17083    0117         39.1     -90.3 2017-10-12 Illinois   Jersey     
# # ℹ 33 more rows
# # ℹ 3 more variables: CBSA_Name <chr>, AQI <dbl>, SiteID <chr>

Group_by is quite powerful to calculate within subgroup summaries and use it with mutate to change each row. For example, we can calculate max AQI of all the monitors in a county for the day

group_by(ozone_airq_df2, stcofips, dateL) %>%
  summarize(maxAQI=max(AQI))
# # A tibble: 242,122 × 3
# # Groups:   stcofips [786]
#    stcofips dateL      maxAQI
#    <chr>    <date>      <dbl>
#  1 01003    2017-02-28     19
#  2 01003    2017-03-01     36
#  3 01003    2017-03-02     42
#  4 01003    2017-03-03     48
#  5 01003    2017-03-04     49
#  6 01003    2017-03-05     48
#  7 01003    2017-03-06     45
#  8 01003    2017-03-07     44
#  9 01003    2017-03-08     44
# 10 01003    2017-03-09     44
# # ℹ 242,112 more rows

This represents the worst AQI recorded in the county for that day. The higher the AQI value, the greater the level of air pollution and the greater the health concern. For example, an AQI value of 50 represents good air quality with little potential to affect public health, while an AQI value over 300 represents hazardous air quality.

An AQI value of 100 generally corresponds to the national air quality standard for the pollutant (in our case Ozone), which is the level EPA has set to protect public health. AQI values below 100 are generally thought of as satisfactory. When AQI values are above 100, air quality is considered to be unhealthy-at first for certain sensitive groups of people, then for everyone as AQI values get higher.

county_summary_df <- group_by(ozone_airq_df2, stcofips, dateL) %>%
  summarize(maxAQI=max(AQI),
            State_Name=first(State_Name), # We can do this because we know that there is only one state corresponding to a stcofips, i.e. counties are embedded in the state.
            County_Name=first(County_Name), # We also need these variables to carry over to the next step.
            CBSA_Name=first(CBSA_Name)) %>%
  group_by(stcofips) %>%
  summarize(AQIgt100 = sum(maxAQI>=100), 
            numDays= n(), 
            percAQIgt100 = AQIgt100/numDays, 
            State_Name=first(State_Name), 
            County_Name=first(County_Name), 
            CBSA_Name=first(CBSA_Name)
            )

county_summary_df
# # A tibble: 786 × 7
#    stcofips AQIgt100 numDays percAQIgt100 State_Name County_Name CBSA_Name      
#    <chr>       <int>   <int>        <dbl> <chr>      <chr>       <chr>          
#  1 01003           2     235      0.00851 Alabama    Baldwin     Daphne-Fairhop…
#  2 01033           0     246      0       Alabama    Colbert     Florence-Muscl…
#  3 01049           0     359      0       Alabama    DeKalb      Fort Payne, AL 
#  4 01051           0     230      0       Alabama    Elmore      Montgomery, AL 
#  5 01055           0     245      0       Alabama    Etowah      Gadsden, AL    
#  6 01069           0     245      0       Alabama    Houston     Dothan, AL     
#  7 01073           2     360      0.00556 Alabama    Jefferson   Birmingham-Hoo…
#  8 01089           1     246      0.00407 Alabama    Madison     Huntsville, AL 
#  9 01097           3     241      0.0124  Alabama    Mobile      Mobile, AL     
# 10 01101           0     242      0       Alabama    Montgomery  Montgomery, AL 
# # ℹ 776 more rows

Exercise

  • Create similar summary by state and CBSA separately.
  • By state and CBSA create monthly summaries of percentage of days with AQI greater than 100.

Exploratory data visualisation

The first thing to do with any data is to explore it and identify some issues, edge cases, outliers and interesting correlations. Exploratory visualisation is indispensable part of data analysis and is often overlooked. I cannot stress the importance of it enough.

“The simple graph has brought more information to the data analyst’s mind than any other device.”—John Tukey

We will use ggplot graphics to explore the datasets. See Hadley’s tutorial

Basic plotting

First, you need to tell ggplot what dataset to use. This is done using the ggplot(df) function, where df is a dataframe that contains all features needed to make the plot.

You can add whatever aesthetics you want to apply to your ggplot (inside aes() argument) - such as X and Y axis by specifying the respective variables from the dataset. The variable based on which the color, size, shape and stroke should change can also be specified here itself. The aesthetics specified here will be inherited by all the geom layers you will add subsequently. Or you can specify the aesthetics for individual layers. I prefer the latter method, even though it is repetitive.

Finally, you can add labels to axes and other features.


g1 <- ozone_airq_df2 %>% 
  filter(stcofips == "06059") %>% # Orange County
  ggplot() + 
  geom_point(aes(x=dateL, y=AQI, color=SiteID)) +
  geom_smooth(aes(x=dateL, y=AQI, color=SiteID), method="loess")+
  scale_colour_brewer(palette = "Set2") + 
  labs(x = "Month", y = "Air Quality Index")

g1

If you have a HTML file, you can use javascript based visualisations that provide some interactivity to statistical graphics. For this we need a package called plotly.

Plotly seems to interact wierdly when MathJAX v3 (to render Math symbols) is enabled on webpages. If you are running into problems that is likely the case. You can avoid ggplotly without losing much in the tutorial.
library(plotly)
ggplotly(g1)

There is also no reason to use a cartesian coordinate system for axes. Polar coordinates work just as well, especially when we have cycles. Though of not much use here, the following will demonstrate coord_polar’s effect.


ozone_airq_df2 %>% 
  filter(stcofips == "06059") %>% # Orange County
  ggplot() + 
  coord_polar(theta = "x")+
  geom_point(aes(x=dateL, y=AQI, color=SiteID), alpha=.5, show.legend = FALSE) +
  geom_smooth(aes(x=dateL, y=AQI, color=SiteID), se=FALSE)+
  scale_colour_brewer(palette = "Dark2") + 
  labs(x = "Month", y = "Air Quality Index")

Facetting

Facetting allows to plot subsets of data together and place them next to one another. This is inspired by Edward Tufte’s small multiples. It allows us to see patterns in the data more clearly than otherwise. Normally, the axis scales on each graph are fixed, which means that they have the same size and range across the facets.

Let’s facet the data for various counties in California, while showing all the monitors within a county in a single graph.

 g1 <- ozone_airq_df2 %>% 
  filter(State_Name == "California") %>% 
  ggplot() + 
  geom_smooth(aes(x=dateL, y=AQI, color=SiteID), method="loess",  se=FALSE)+
  scale_colour_grey(end = 0)+
  facet_wrap(~stcofips)+
  labs(x = "Month", y = "Air Quality Index") + 
  theme_bw() + 
  theme(axis.text.x=element_blank(),
        legend.position="none")

g1

Exercise

  • Facet CA counties by CBSA (i.e CBSA in rows and Counties in Columns) and produce the same graph as above.
  • Using facetting as above, limit the x-axis to summer months.

If we were to show this in a single graph, we could do something as below. This graph shows how Ozone is worse during the summer months than the rest of the year.

g1 <- ozone_airq_df2 %>% 
  filter(State_Name == "California")  %>% 
  ggplot() + 
  geom_smooth(aes(x=dateL, y=AQI, group=SiteID, color=stcofips), method="loess", se=FALSE)+
  labs(x = "Month", y = "Air Quality Index") + 
  theme_bw() 

ggplotly(g1)

The display is non-inutitive. Colors are a terrible way to convey information, even though we use it all the time. Especially colors that are on a gradient to represent categorical data such as County. I much prefer the facetted plot.

Statitical summaries by group

g1 <- ozone_airq_df2 %>% 
  filter(stcofips == "06037") %>% 
  ggplot() + 
  geom_point(aes(x=Site_Num, y=AQI), show.legend = F)+
  stat_summary(aes(x=Site_Num, y=AQI), fun='median', colour = "red", size = 1)+
  stat_summary(aes(x=Site_Num, y=AQI),fun='mean', colour = "green", size = 1, shape=3)+
  labs(x = "Site", y = "Air Quality Index", title="Los Angeles County") + 
  theme_bw() 

g1

A violin or box plot is better.


g1 <- ozone_airq_df2 %>% 
  filter(stcofips == "06037") %>% 
  ggplot() + 
  geom_violin(aes(x=Site_Num, y=AQI), show.legend = F)+
  stat_summary(aes(x=Site_Num, y=AQI), fun='median', colour = "red", size = 1)+
  stat_summary(aes(x=Site_Num, y=AQI),fun='mean', size = 1, shape=3)+
  labs(x = "Site", y = "Air Quality Index", title="Los Angeles County") + 
  theme_bw() 

#ggplotly(g1)
g1

Exercise

  • Add 0.25 and 0.75 quantiles as points to the above plot.

It would be useful which counties have the worst pollution. We can define worse pollution by percent of days where AQI from Ozone is greater than or equal to 100. We can accomplish this with a bar plot and plot only the top 25 worst polluted counties.

g2 <- county_summary_df %>%
  top_n(25, percAQIgt100) %>%
  ggplot()+
  geom_col(aes(x=reorder(paste(County_Name, State_Name, sep=","), percAQIgt100), y=percAQIgt100, fill=State_Name)) + 
  scale_fill_brewer(palette = "Dark2", name="State") +
  coord_flip()+
  labs(y="Proportion of days with Ozone AQI greater than 100", x="County", Title="Top 25 polluted counties")+
  theme_minimal()

ggplotly(g2, tooltip = c("y"))

Exercise

  • Use a different color palette than Dark2 in the above plot. Which one works better? Why?

Geographic mapping

To see exactly where these stations are, let us visualise the station locations. To do this we use the leaflet library. One disadvantage with leaflet is that coordinate system projections are not fully supported. Spatial data should ideally be in WGS84 lat long coordinate system. You can use Proj4Leaflet plugin to display alternate projection. But that requires the basemap tiles to work with your projection. Fortunately, our dataset is already in WGS84 lat long coordinate system.

library(leaflet)

monitorlocations_df <- unique(ozone_airq_df[,c('Longitude', 'Latitude')]) 

#Note the reversal of traditional Lat/Long paradigm. Longitude is usually on the X-axis. While it is not relevant here, as leaflet understands and parses them correctly, you might as well develop good habits. Many visualisation issues arise from messing these up.

m <-  leaflet(monitorlocations_df) %>%
  addProviderTiles(providers$CartoDB.Positron, group = "Basemap") %>%
  addCircles(group = "Monitors")%>%
  addLayersControl(
    overlayGroups = c("Monitors", 'Basemap'),
    options = layersControlOptions(collapsed = FALSE)
      )

library(widgetframe)
widgetframe::frameWidget(m, height = 350, width = '95%')

Exercise

  • Show only locations that have consistently high AQI values in the summer months? You have to make some decisions about what does consistent and high mean.
  • What exploratory graphics will you use to demonstrate that there is a spatial autocorrelation in the measurements?

HTML widgets such as plotly and leaflet etc. are little finicky. They may not work well in many instances in a html file. Certainly won’t work when you knit to a docx or pdf. In many instances, it may be beneficial to stick to static images provided by ggplot or other frameworks.

In particular, for the moment, do not continue the code chunk after you call the htmlwidget, instead create a new one.

Conclusions

We covered a lot of ground in this tutorial. I do not expect all of this to make sense right away and you will be proficient by the end of this tutorial. I constantly refer to old code to reuse, make mistakes, correct them, solve small problems.

The main point is not to use analysis for its own sake, but to illustrate a story. Exploratory data anlysis will get you a long way in telling the story. You should spend a lot of time cleaning the data, understanding it and figuring out the best way to communicate your insights.

Additional references

Always use the help commands. One of the strengths of R is its effective doucumentation, which includes examples and usage. Help on a particular command can be accessed with ? e.g. ?group_by.

The following books are useful for reference.

  • Grolemund, Garrett and Hadley Wickham (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. Sebastopol, CA: O’ Reilly media,. http://r4ds.had.co.nz/.

  • Venables, W.N. and Ripley, B.D. (2003). Modern Applied Statistics with S. New York, NY: Springer

In particular, Grolemund and Wickham is excellent for data munging operations. This usually takes up large portion of the analytical work.

Nikhil Kaza
Nikhil Kaza
Professor

My research interests include urbanization patterns, local energy policy and equity

Related