Exercise 4 : Manipulating data with tidyverse

 

In this exercise you’ll learn how to import your data into R as a tibble object, how to usefully summarise and manipulate dataframes and finally how to export data from R. All of this material is covered in Chapter 4 of the Introduction to R book.  

  1. As in previous exercises, either create a new R script or continue with your previous R script in your RStudio Project. Again, make sure you include any metadata you feel is appropriate (title, description of task, date of creation etc) and don’t forget to comment out your metadata with a # at the beginning of the line.

 

  1. Download the data file ‘NC_AQI_data.csv’ from the Data link and save it to the data directory inside your RStudio project directory.

  2. Time for a quick description of the dataset to get your bearings. PurpleAir is a community system of PM (10, 2.5, 1.0) low-cost sensors measuring particulate matter. The PurpleAir system is composed of many sensors that are installed, controlled and maintained by members of the community.PurpleAir sensors measure airborne particulate matter (PM). Particulate matter describes solid particles suspended in air; this includes dust, smoke, and other organic and inorganic particles. PurpleAir sensors use laser particle counters to count the number of particles by particle sizes 0.3, 0.5, 1, 2.5, 5, and 10 \(\mu\)m, and use the count data to calculate mass concentrations of PM1.0, PM2.5, and PM10.The PM concentration can be calculated into Air quality index (AQI), which is used for reporting daily air quality ranging from 0-500. If AQI value is higher than 100, we may suppose that the local area is polluted. Other than PM concentration, PurpleAir sensors also measure the other meteorological factors like temperature and humidity.

In this sample dataset, we use the records of 144 PurpleAir sensors in North Carolina generated during 2022/02/20 to 2022/03/20. Within this dataset, the sensor_ID variable represents the ID of each PurpleAir sensor, the “date_2020”and “dateu” represent the date we the record collected. The “PM2.5_daily” variable represents the daily highest value of AQI index calculated by the PM2.5 concentration. The temperature variable measures the air temperature of the time when the sensor collected the highest AQI value.
 

  1. Please review the section Section 3.4.2 Import functions of the Introduction to R book or watch this video if you need any further information. Then import the “NC_AQI_data.csv”data.You may install tidyverse package at first and then apply the “read_csv” function to load the dataset. Please review the Section 4.1 Data Frames & Tibbles for the tidyverse package as well as the Section 4.2 Reading external data  

  2. Once you have imported your data file, to examine the contents of the tibble you may replicate the str() function to display the structure of the dataset and a neat summary of your variables. How many observations does this dataset have? How many variables are in this dataset? What type of variables are ID and dateu and date_2020? You can also use glimpse() to get a sense of the actual values in the dataset.  

  3. You may review the summary() function. This will provide you with some useful summary statistics for each variable. Notice how the type of output depends on whether the variable is a factor or a number. Another useful feature of the summary() function is that it will also count the number of missing values in each variable. Besides, for the numeric variables, you can also apply the summary() function to identify the distribution pattern of each variable, like finding potential outliers or exterme values. Which variables have missing values and how many? Besides, which variable potentially have outliers we need to remove in the data preprocessing session?
     

  4. We can revisit Section 3.2.1 and refresh our understanding of date and time in R. Notice that date_2020 and dateu are stored as character and numeric and we want to convert them into dates. Also notice that the formats are different. One is mm/dd/yy and other is yyyy-mm-dd. Using functions such as mdy() and ymd() from lubridate package confirm that date_2020 and dateu represent the same data.

  5. We will apply review the section Section 4.3 Wrangling Data and select the variables from this dataset. For instance, the variable “date_2020” and “dateu” are duplicate, we may only keep one. So please use the selectfunction to select ID, dateu, PM2.5_daily, and temperature variables and store it into a different object.  

  6. You may find the variable names in this dataset are not very clear and descriptive, given that, we could apply the rename() function in the tidyverse package to rename them. We may rename the “PM2.5_daily” as “AQI”, rename the “dateu” as “date_collected”; and rename “ID” as “sensor_ID”.  

  7. You may find there are some NA records in the “temperature” variable, we need to remove those records in our following analysis. There are many ways to remove the NA rows for a specific column. Generally, there are three options. The first option is to df[!is.na(df$col_name),], the second is way is to use subset method, subset(df, !is.na(col_name)). The third method is to use dplyr method df %>% drop_na(col_name). Now please try the three methods to remove the NA records on temperature. Please report how many row remained after removing the NA record.  

  8. Other than the NA records in temperature, you may find there are also some extreme values in the temperature variable like “-227F” now we suppose the air temperature values higher than 120F or below 0F would be extreme values, please refer to the Section 4.3 Wrangling Data, and use the filter function to remove those extreme temperature values. Now, how many variables remained? Now, for the AQI variable, there are also some outliers existing as 0, please replicate the filter function and remove those outliers? How many records remain?  

  9. The “lubridate” package is very useful in manipulating date variables. It can pick out the year,month,day,hour…values from the date column You may refer to this instruction for more information. Now, please use the month() function to pick up the month information from the “date_collected” variable. Then remove all records in February and only keep the records in March. How many records remain this time?  

  10. mutate() function is used to create new variables based on the operations on the existing variables.Please refer to the code in this Section 4.3 Wrangling Data and use the mutate function to create another variable representing the pollution levels based on the AQI value. If the AQI is higher than 100, we define it as “Polluted”, otherwise, we may define it “Normal”. How many records are identified as “Polluted”?  

  11. Section 4.4 Relational Data introduces how to join multiple tables together. Commonly, the left_join()can be used to join two datasets and keep all rows in the first dataset.

To do this, load the csv table called “NC_county”. In this table, firstly the “sensor_ID” variable represents for the PurpleAir sensor’s ID, the “County” variable represents the county where the sensor locates in.”stcofips” represents each county’s FIPS code (an unique identifier for each County in the US). For the rest variables, NC counties are categorised based on the Office of Management and Budget (“omb”) (urban center, urban outlying and non urban),unemployment rate (“unemp”) (high/medium/low), median income (“medinc”) (high/medium low) and location (“pz”) (NorthCentral, Piedmont etc.).

Please use the dataset we generated for question #11 as the first dataset, to left join the NC_county data table by left_joinfunction and “sensor_ID” variable. This table will be used for Exercise 5. Besides, based on the joint table, please filter to the “Durham” county byfilter(County=="Durham")and report the county’s location, unemployment rate, office of management and budget. Besides, how many records generated in Durham during 2022/03?  

  1. Before the end of the exercise, let’s write the data table generated in Q14 into an csv file and save it for our Exercise 5 by Section 3.7 Exporting data section. Here we name the output file as “AQI_March_Joint.csv”  

End of Exercise 4.