3.6 Summarising data frames

Now that we’re able to manipulate and extract data from our data frames our next task is to start exploring and getting to know our data. In this section we’ll start producing tables of useful summary statistics of the variables in our data frame and in the next two Chapters we’ll cover visualising our data with base R graphics and using the ggplot2 package.

A really useful starting point is to produce some simple summary statistics of all of the variables in our str data frame using the summary() function.

nola_str$Permit.Type <- factor(nola_str$Permit.Type)
nola_str$Current.Status <- factor(nola_str$Current.Status)
summary(nola_str)
##  Permit.Number        Address                                       Permit.Type
##  Length:23          Length:23                                             : 1  
##  Class :character   Class :character   Commercial STR                     : 1  
##  Mode  :character   Mode  :character   Short Term Rental Commercial Owner :10  
##                                        Short Term Rental Residential Owner:11  
##                                                                                
##                                                                                
##                                                                                
##  Residential.Subtype Current.Status Expiration.Date    Bedroom.Limit  
##  Length:23                  : 1     Length:23          Min.   :1.000  
##  Class :character    Issued : 3     Class :character   1st Qu.:1.000  
##  Mode  :character    Pending:19     Mode  :character   Median :2.000  
##                                                        Mean   :2.227  
##                                                        3rd Qu.:3.000  
##                                                        Max.   :5.000  
##                                                        NA's   :1      
##  Guest.Occupancy.Limit Operator.Name      License.Holder.Name
##  Min.   : 2.000        Length:23          Length:23          
##  1st Qu.: 2.000        Class :character   Class :character   
##  Median : 4.000        Mode  :character   Mode  :character   
##  Mean   : 4.455                                              
##  3rd Qu.: 6.000                                              
##  Max.   :10.000                                              
##  NA's   :1                                                   
##  Application.Date    Issue_Date       
##  Length:23          Length:23         
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

For numeric variables the mean, minimum, maximum, median, first (lower) quartile and third (upper) quartile are presented. For factor variables (i.e. Permit.Type and current.status) the number of observations in each of the factor levels is given. If a variable contains missing data then the number of NA values is also reported. For character variables, only length of the vector is reported.

If we wanted to summarise a smaller subset of variables in our data frame we can use our indexing skills in combination with the summary() function. Notice we include all rows by not specifying a row index.

summary(nola_str[, 4:7])
##  Residential.Subtype Current.Status Expiration.Date    Bedroom.Limit  
##  Length:23                  : 1     Length:23          Min.   :1.000  
##  Class :character    Issued : 3     Class :character   1st Qu.:1.000  
##  Mode  :character    Pending:19     Mode  :character   Median :2.000  
##                                                        Mean   :2.227  
##                                                        3rd Qu.:3.000  
##                                                        Max.   :5.000  
##                                                        NA's   :1

And to summarise a single variable.

summary(nola_str$Permit.Type)
##                                                          Commercial STR 
##                                   1                                   1 
##  Short Term Rental Commercial Owner Short Term Rental Residential Owner 
##                                  10                                  11

As you’ve seen above, the summary() function reports the number of observations in each level of our factor variables. Another useful function for generating tables of counts is the table() function. The table() function can be used to build contingency tables of different combinations of factor levels. For example, to count the number of observations for each level of Permit.Type

table(nola_str$Permit.Type)
## 
##                                                          Commercial STR 
##                                   1                                   1 
##  Short Term Rental Commercial Owner Short Term Rental Residential Owner 
##                                  10                                  11

We can extend this further by producing a table of counts for each combination of Permit.Type and Current.Status factor levels.

table(nola_str$Permit.Type, nola_str$Current.Status)
##                                      
##                                         Issued Pending
##                                       1      0       0
##   Commercial STR                      0      0       1
##   Short Term Rental Commercial Owner  0      1       9
##   Short Term Rental Residential Owner 0      2       9

A more flexible version of the table() function is the xtabs() function. The xtabs() function uses a formula notation (~) to build contingency tables with the cross-classifying variables separated by a + symbol on the right hand side of the formula. xtabs() also has a useful data = argument so you don’t have to include the data frame name when specifying each variable.

xtabs(~ Permit.Type + Current.Status, data = nola_str)
##                                      Current.Status
## Permit.Type                             Issued Pending
##                                       1      0       0
##   Commercial STR                      0      0       1
##   Short Term Rental Commercial Owner  0      1       9
##   Short Term Rental Residential Owner 0      2       9