Chapter 2 Introducing the tidyverse

Data frames will be the basis for many analyses. A set of packages makes working with them exceptionally easy. This is called the tidyverse. For most of our data manipulation a package called dplyr (d[ata] - plier ) will be the go to. For reading our data readr. What is nice, though, is that we can download all of these packages together in what is called the tidyverse. When we load the tidyverse we load most of the packages associated with it.

To do this run install.packages("tidyverse"). Load the package by running library(tidyverse).

Before we can get to working with data, we will have to bring it into our environment. The data will be coming from a project from The Guardian called “The Counted”. The Guardian describes it as follows.

The Counted is a project by the Guardian – and you – working to count the number of people killed by police and other law enforcement agencies in the United States throughout 2015 and 2016, to monitor their demographics and to tell the stories of how they died.

This dataset lives in the counted.csv file. csv stands for comma separated values. This means that each value has a comma in between it letting us know when the next column begins.

readr has a function called read_csv() which will take a csv and turn it into a dataframe for us to work with.

# load tidyverse packages
library(tidyverse)


# read in the dataset
counted <- read_csv("data/the-counted/counted.csv")
## Parsed with column specification:
## cols(
##   name = col_character(),
##   age = col_double(),
##   gender = col_character(),
##   raceethnicity = col_character(),
##   armed = col_character(),
##   date = col_date(format = ""),
##   streetaddress = col_character(),
##   city = col_character(),
##   state = col_character(),
##   latitude = col_double(),
##   longitude = col_double(),
##   classification = col_character(),
##   lawenforcementagency = col_character()
## )
counted
## # A tibble: 2,226 x 13
##    name    age gender raceethnicity armed date       streetaddress city 
##    <chr> <dbl> <chr>  <chr>         <chr> <date>     <chr>         <chr>
##  1 Matt…    22 Male   Black         No    2015-01-01 1050 Carl Gr… Sava…
##  2 Lewi…    47 Male   White         Fire… 2015-01-02 4505 SW Mast… Aloha
##  3 Mich…    19 Male   White         No    2015-01-03 2600 Kaumual… Kaum…
##  4 John…    23 Male   Hispanic/Lat… No    2015-01-03 500 North Ol… Wich…
##  5 Tim …    53 Male   Asian/Pacifi… Fire… 2015-01-02 600 E Island… Shel…
##  6 Matt…    32 Male   White         Non-… 2015-01-04 630 Valencia… San …
##  7 Kenn…    22 Male   Hispanic/Lat… Fire… 2015-01-05 E Knox Rd an… Chan…
##  8 Mich…    39 Male   Hispanic/Lat… Other 2015-01-05 818 31st St   Evans
##  9 Patr…    25 Male   White         Knife 2015-01-06 800 Howard St Stoc…
## 10 Bria…    26 Male   Black         No    2015-01-06 1618 E 123rd… Los …
## # ... with 2,216 more rows, and 5 more variables: state <chr>,
## #   latitude <dbl>, longitude <dbl>, classification <chr>,
## #   lawenforcementagency <chr>

2.1 Selecting data

In the last section we went over how to select individual columns using the $ symbol and using bracket notation. Those methods can become quite cumbersome to work with. dplyr provides an alternative method for selecting individual columns. For this we can use the select() function. select() works quite intuitively. The first argument to the function is the dataframe which you select from. Every subsequent argument is the name or position of a column.

Note: The first argument for [almost] every function in the tidyverse is the data. This will be very helpful to remember when we start using the pipe %>%.

Say we wanted to select the name of every person killed in 2015 and 2016. This is quite simple.

select(counted, name)
## # A tibble: 2,226 x 1
##    name             
##    <chr>            
##  1 Matthew Ajibade  
##  2 Lewis Lembke     
##  3 Michael Kocher Jr
##  4 John Quintero    
##  5 Tim Elliott      
##  6 Matthew Hoffman  
##  7 Kenneth Buck     
##  8 Michael Rodriguez
##  9 Patrick Wetter   
## 10 Brian Pickett    
## # ... with 2,216 more rows

Using the same notation we can select multiple columns. For example for name and age. The order that you select columns affects the order in which they appear in your output. As I put name first followed by age, the first column is name.

select(counted, name, age)
## # A tibble: 2,226 x 2
##    name                age
##    <chr>             <dbl>
##  1 Matthew Ajibade      22
##  2 Lewis Lembke         47
##  3 Michael Kocher Jr    19
##  4 John Quintero        23
##  5 Tim Elliott          53
##  6 Matthew Hoffman      32
##  7 Kenneth Buck         22
##  8 Michael Rodriguez    39
##  9 Patrick Wetter       25
## 10 Brian Pickett        26
## # ... with 2,216 more rows

Simple enough? Try it out. Try selecting age, state, armed, and lawenforcementagency.

Solution:

select(counted, age, state, armed, lawenforcementagency)
## # A tibble: 2,226 x 4
##      age state armed              lawenforcementagency                   
##    <dbl> <chr> <chr>              <chr>                                  
##  1    22 GA    No                 Chatham County Sheriff's Office        
##  2    47 OR    Firearm            Washington County Sheriff's Office     
##  3    19 HI    No                 Kauai Police Department                
##  4    23 KS    No                 Wichita Police Department              
##  5    53 WA    Firearm            Mason County Sheriff's Office          
##  6    32 CA    Non-lethal firearm San Francisco Police Department        
##  7    22 AZ    Firearm            Chandler Police Department             
##  8    39 CO    Other              Evans Police Department                
##  9    25 CA    Knife              Stockton Police Department             
## 10    26 CA    No                 Los Angeles County Sheriff's Department
## # ... with 2,216 more rows

2.2 Filtering your data

Remember those logical statements in the last section? Those will be very useful now. We can constrain our dataset under a set of criteria to return a subset of the original data frame.

We can use our existing knowledge of vectors and data frames to create a subset of the data. Using our knowledge of logical vectors and bracket subsets, we can somewhat easily find all of the 20 year olds in our dataset. A solution to this would require us to create a logical vector to indicate the rows where the age is 20. Then we would have to supply that vector as an index to our dataframe to get our desired result. It would look like this:

index <- counted$age == 20

counted[index,]
## # A tibble: 50 x 13
##    name    age gender raceethnicity armed date       streetaddress city 
##    <chr> <dbl> <chr>  <chr>         <chr> <date>     <chr>         <chr>
##  1 Jani…    20 Female Black         Knife 2015-02-19 Bellefonte Dr Char…
##  2 <NA>     NA <NA>   <NA>          <NA>  NA         <NA>          <NA> 
##  3 Shaq…    20 Male   Black         Fire… 2015-03-02 1st Ave and … Joli…
##  4 Euge…    20 Male   White         Fire… 2015-03-17 13710 US Hwy… Onal…
##  5 Jami…    20 Male   White         Other 2015-03-19 Kneuman Rd    Sumas
##  6 Todd…    20 Male   Black         Fire… 2015-04-24 1505 E Main … Trin…
##  7 Terr…    20 Male   Black         Other 2015-04-27 9500 Evergre… Detr…
##  8 Fera…    20 Male   Arab-American No    2015-05-27 4600 E 15th … Long…
##  9 Tyro…    20 Male   Black         Fire… 2015-06-22 700 Saw Mill… Pitt…
## 10 Tyle…    20 Male   White         Fire… 2015-07-06 3300 SW 47th… Okla…
## # ... with 40 more rows, and 5 more variables: state <chr>,
## #   latitude <dbl>, longitude <dbl>, classification <chr>,
## #   lawenforcementagency <chr>

filter() allows us to do this in a simpler manner. The first argument (suprise) is the dataframe that will be subsetted. Every following argument is a logical statement that will be applied to the dataset. Whenever the logical statement returns TRUE that row will be returned as shown in the above base R code.

filter(counted, age == 20)
## # A tibble: 34 x 13
##    name    age gender raceethnicity armed date       streetaddress city 
##    <chr> <dbl> <chr>  <chr>         <chr> <date>     <chr>         <chr>
##  1 Jani…    20 Female Black         Knife 2015-02-19 Bellefonte Dr Char…
##  2 Shaq…    20 Male   Black         Fire… 2015-03-02 1st Ave and … Joli…
##  3 Euge…    20 Male   White         Fire… 2015-03-17 13710 US Hwy… Onal…
##  4 Jami…    20 Male   White         Other 2015-03-19 Kneuman Rd    Sumas
##  5 Todd…    20 Male   Black         Fire… 2015-04-24 1505 E Main … Trin…
##  6 Terr…    20 Male   Black         Other 2015-04-27 9500 Evergre… Detr…
##  7 Fera…    20 Male   Arab-American No    2015-05-27 4600 E 15th … Long…
##  8 Tyro…    20 Male   Black         Fire… 2015-06-22 700 Saw Mill… Pitt…
##  9 Tyle…    20 Male   White         Fire… 2015-07-06 3300 SW 47th… Okla…
## 10 Fred…    20 Male   Black         Fire… 2015-07-12 5130 E Ponce… Ston…
## # ... with 24 more rows, and 5 more variables: state <chr>,
## #   latitude <dbl>, longitude <dbl>, classification <chr>,
## #   lawenforcementagency <chr>

Now, we can add another condition on to this function call to get all of the female twenty year olds.

filter(counted, age == 20, gender == "Female")
## # A tibble: 1 x 13
##   name    age gender raceethnicity armed date       streetaddress city 
##   <chr> <dbl> <chr>  <chr>         <chr> <date>     <chr>         <chr>
## 1 Jani…    20 Female Black         Knife 2015-02-19 Bellefonte Dr Char…
## # ... with 5 more variables: state <chr>, latitude <dbl>, longitude <dbl>,
## #   classification <chr>, lawenforcementagency <chr>

2.3 Chaining functions

The true power of the tidyverse comes from it’s ability to chain functions after eachother. This is all enabled by the forward pipe operator %>%. The pipe operator takes the output of a function and provides that output as the first argument in the following function. You’ve seen how the first argument for every function here has been the data this is done purposefully to enable the use of the pipe.

As always, the most helpful way to wrap your head around this is to see it in action. Let’s take one of the lines of code we used above and adapt it to use a pipe. We will select the name column of our data again. Previously we wrote select(data_frame, col_name).

select(counted, name)
## # A tibble: 2,226 x 1
##    name             
##    <chr>            
##  1 Matthew Ajibade  
##  2 Lewis Lembke     
##  3 Michael Kocher Jr
##  4 John Quintero    
##  5 Tim Elliott      
##  6 Matthew Hoffman  
##  7 Kenneth Buck     
##  8 Michael Rodriguez
##  9 Patrick Wetter   
## 10 Brian Pickett    
## # ... with 2,216 more rows
counted %>% 
  select(name)
## # A tibble: 2,226 x 1
##    name             
##    <chr>            
##  1 Matthew Ajibade  
##  2 Lewis Lembke     
##  3 Michael Kocher Jr
##  4 John Quintero    
##  5 Tim Elliott      
##  6 Matthew Hoffman  
##  7 Kenneth Buck     
##  8 Michael Rodriguez
##  9 Patrick Wetter   
## 10 Brian Pickett    
## # ... with 2,216 more rows

This gets the basic point across but doesn’t adequately illustrate the power. So let’s combine filter() and select() to get the names of all 20 year olds in our dataset. To do this we will first filter our dataset, then pipe it to our the select function.

counted %>% 
  filter(age == 20) %>% 
  select(name)
## # A tibble: 34 x 1
##    name             
##    <chr>            
##  1 Janisha Fonville 
##  2 Shaquille Barrow 
##  3 Eugene Smith II  
##  4 Jamison Childress
##  5 Todd Dye         
##  6 Terrance Kellom  
##  7 Feras Morad      
##  8 Tyrone Harris    
##  9 Tyler Rogers     
## 10 Frederick Farmer 
## # ... with 24 more rows
glimpse(counted)
## Observations: 2,226
## Variables: 13
## $ name                 <chr> "Matthew Ajibade", "Lewis Lembke", "Micha...
## $ age                  <dbl> 22, 47, 19, 23, 53, 32, 22, 39, 25, 26, 3...
## $ gender               <chr> "Male", "Male", "Male", "Male", "Male", "...
## $ raceethnicity        <chr> "Black", "White", "White", "Hispanic/Lati...
## $ armed                <chr> "No", "Firearm", "No", "No", "Firearm", "...
## $ date                 <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015...
## $ streetaddress        <chr> "1050 Carl Griffin Dr", "4505 SW Masters ...
## $ city                 <chr> "Savannah", "Aloha", "Kaumakani", "Wichit...
## $ state                <chr> "GA", "OR", "HI", "KS", "WA", "CA", "AZ",...
## $ latitude             <dbl> 32.06669, 45.48747, 21.93335, 37.69380, 4...
## $ longitude            <dbl> -81.16788, -122.89170, -159.64197, -97.28...
## $ classification       <chr> "Death in custody", "Gunshot", "Struck by...
## $ lawenforcementagency <chr> "Chatham County Sheriff's Office", "Washi...

2.4 Creating / manipulating data

Within most datasets there are what are called latent variables. These are varibles that can be created by manipulating one or more columns. The way we create new columns in a tidy workflow is by using the mutate() function. This function allows us to assign column values directly or by manipulating existing columns.

The argument structure for mutate() is quite simple. We first name the new column we are creating, then say it is equivalent to some statement. For example:

counted %>% 
  mutate(x = 1) %>% 
  select(x)
## # A tibble: 2,226 x 1
##        x
##    <dbl>
##  1     1
##  2     1
##  3     1
##  4     1
##  5     1
##  6     1
##  7     1
##  8     1
##  9     1
## 10     1
## # ... with 2,216 more rows

In this line of code we created a new column called x which is set to the value of 1 and then we select just that column.

Next we will use mutate() and functions from the lubridate package to get the month, day, and year of the date of the individual’s death. As there is already a column called date which is of the date type (class(counted$date)) we can use the functions month(), day(), and year() from lubridate which return the integer corresponding with the date part we are trying to extract.

library(lubridate)

counted %>%
  mutate(month = month(date),
         day = day(date),
         year = year(date)) %>% 
  # select date and our new columns 
  select(date, month, day, year)
## # A tibble: 2,226 x 4
##    date       month   day  year
##    <date>     <dbl> <int> <dbl>
##  1 2015-01-01     1     1  2015
##  2 2015-01-02     1     2  2015
##  3 2015-01-03     1     3  2015
##  4 2015-01-03     1     3  2015
##  5 2015-01-02     1     2  2015
##  6 2015-01-04     1     4  2015
##  7 2015-01-05     1     5  2015
##  8 2015-01-05     1     5  2015
##  9 2015-01-06     1     6  2015
## 10 2015-01-06     1     6  2015
## # ... with 2,216 more rows

The new year column can be used to identify what year an individual was born. For this, we can subtract their age from the year of their death. Note we can use variables that were created in our mutate function call as we will do below. Since year was created in the mutate, our new birth_year variable must come after it.

counted_age <- counted %>%
  mutate(month = month(date),
         day = day(date),
         year = year(date),
        birth_year = year - age) %>% 
  select(age, birth_year)

counted_age
## # A tibble: 2,226 x 2
##      age birth_year
##    <dbl>      <dbl>
##  1    22       1993
##  2    47       1968
##  3    19       1996
##  4    23       1992
##  5    53       1962
##  6    32       1983
##  7    22       1993
##  8    39       1976
##  9    25       1990
## 10    26       1989
## # ... with 2,216 more rows

A useful function for viewing our dataframes is arrange() it allows us to sort the datset by a given column. By default is sorts in ascending order.

counted_age %>% 
  arrange(birth_year)
## # A tibble: 2,226 x 2
##      age birth_year
##    <dbl>      <dbl>
##  1    87       1928
##  2    87       1929
##  3    85       1930
##  4    86       1930
##  5    86       1930
##  6    83       1932
##  7    84       1932
##  8    83       1933
##  9    82       1934
## 10    80       1936
## # ... with 2,216 more rows

To sort a dataframe in desecending order wrap the column name in the desc() function (if the column is numeric you could also put a - in front of the column name).

counted_age %>% 
# arrange(desc(birth_year))
  arrange(-birth_year)
## # A tibble: 2,226 x 2
##      age birth_year
##    <dbl>      <dbl>
##  1     6       2009
##  2    10       2006
##  3    12       2004
##  4    13       2003
##  5    14       2002
##  6    15       2001
##  7    15       2001
##  8    15       2001
##  9    15       2000
## 10    15       2000
## # ... with 2,216 more rows