Chapter 2 Introducing the tidyverse
Data frames will be the basis for many analyses. A set of packages makes working with them exceptionally easy. This is called the tidyverse
. For most of our data manipulation a package called dplyr
(d[ata] - plier ) will be the go to. For reading our data readr
. What is nice, though, is that we can download all of these packages together in what is called the tidyverse
. When we load the tidyverse
we load most of the packages associated with it.
To do this run install.packages("tidyverse")
. Load the package by running library(tidyverse)
.
Before we can get to working with data, we will have to bring it into our environment. The data will be coming from a project from The Guardian called “The Counted”. The Guardian describes it as follows.
The Counted is a project by the Guardian – and you – working to count the number of people killed by police and other law enforcement agencies in the United States throughout 2015 and 2016, to monitor their demographics and to tell the stories of how they died.
This dataset lives in the counted.csv
file. csv
stands for comma separated values. This means that each value has a comma in between it letting us know when the next column begins.
readr
has a function called read_csv()
which will take a csv and turn it into a dataframe for us to work with.
# load tidyverse packages
library(tidyverse)
# read in the dataset
counted <- read_csv("data/the-counted/counted.csv")
## Parsed with column specification:
## cols(
## name = col_character(),
## age = col_double(),
## gender = col_character(),
## raceethnicity = col_character(),
## armed = col_character(),
## date = col_date(format = ""),
## streetaddress = col_character(),
## city = col_character(),
## state = col_character(),
## latitude = col_double(),
## longitude = col_double(),
## classification = col_character(),
## lawenforcementagency = col_character()
## )
counted
## # A tibble: 2,226 x 13
## name age gender raceethnicity armed date streetaddress city
## <chr> <dbl> <chr> <chr> <chr> <date> <chr> <chr>
## 1 Matt… 22 Male Black No 2015-01-01 1050 Carl Gr… Sava…
## 2 Lewi… 47 Male White Fire… 2015-01-02 4505 SW Mast… Aloha
## 3 Mich… 19 Male White No 2015-01-03 2600 Kaumual… Kaum…
## 4 John… 23 Male Hispanic/Lat… No 2015-01-03 500 North Ol… Wich…
## 5 Tim … 53 Male Asian/Pacifi… Fire… 2015-01-02 600 E Island… Shel…
## 6 Matt… 32 Male White Non-… 2015-01-04 630 Valencia… San …
## 7 Kenn… 22 Male Hispanic/Lat… Fire… 2015-01-05 E Knox Rd an… Chan…
## 8 Mich… 39 Male Hispanic/Lat… Other 2015-01-05 818 31st St Evans
## 9 Patr… 25 Male White Knife 2015-01-06 800 Howard St Stoc…
## 10 Bria… 26 Male Black No 2015-01-06 1618 E 123rd… Los …
## # ... with 2,216 more rows, and 5 more variables: state <chr>,
## # latitude <dbl>, longitude <dbl>, classification <chr>,
## # lawenforcementagency <chr>
2.1 Selecting data
In the last section we went over how to select individual columns using the $
symbol and using bracket notation. Those methods can become quite cumbersome to work with. dplyr
provides an alternative method for selecting individual columns. For this we can use the select()
function. select()
works quite intuitively. The first argument to the function is the dataframe which you select from. Every subsequent argument is the name or position of a column.
Note: The first argument for [almost] every function in the tidyverse
is the data. This will be very helpful to remember when we start using the pipe %>%
.
Say we wanted to select the name of every person killed in 2015 and 2016. This is quite simple.
select(counted, name)
## # A tibble: 2,226 x 1
## name
## <chr>
## 1 Matthew Ajibade
## 2 Lewis Lembke
## 3 Michael Kocher Jr
## 4 John Quintero
## 5 Tim Elliott
## 6 Matthew Hoffman
## 7 Kenneth Buck
## 8 Michael Rodriguez
## 9 Patrick Wetter
## 10 Brian Pickett
## # ... with 2,216 more rows
Using the same notation we can select multiple columns. For example for name and age. The order that you select columns affects the order in which they appear in your output. As I put name first followed by age, the first column is name.
select(counted, name, age)
## # A tibble: 2,226 x 2
## name age
## <chr> <dbl>
## 1 Matthew Ajibade 22
## 2 Lewis Lembke 47
## 3 Michael Kocher Jr 19
## 4 John Quintero 23
## 5 Tim Elliott 53
## 6 Matthew Hoffman 32
## 7 Kenneth Buck 22
## 8 Michael Rodriguez 39
## 9 Patrick Wetter 25
## 10 Brian Pickett 26
## # ... with 2,216 more rows
Simple enough? Try it out. Try selecting age
, state
, armed
, and lawenforcementagency
.
Solution:
select(counted, age, state, armed, lawenforcementagency)
## # A tibble: 2,226 x 4
## age state armed lawenforcementagency
## <dbl> <chr> <chr> <chr>
## 1 22 GA No Chatham County Sheriff's Office
## 2 47 OR Firearm Washington County Sheriff's Office
## 3 19 HI No Kauai Police Department
## 4 23 KS No Wichita Police Department
## 5 53 WA Firearm Mason County Sheriff's Office
## 6 32 CA Non-lethal firearm San Francisco Police Department
## 7 22 AZ Firearm Chandler Police Department
## 8 39 CO Other Evans Police Department
## 9 25 CA Knife Stockton Police Department
## 10 26 CA No Los Angeles County Sheriff's Department
## # ... with 2,216 more rows
2.2 Filtering your data
Remember those logical statements in the last section? Those will be very useful now. We can constrain our dataset under a set of criteria to return a subset of the original data frame.
We can use our existing knowledge of vectors and data frames to create a subset of the data. Using our knowledge of logical vectors and bracket subsets, we can somewhat easily find all of the 20 year olds in our dataset. A solution to this would require us to create a logical vector to indicate the rows where the age is 20. Then we would have to supply that vector as an index to our dataframe to get our desired result. It would look like this:
index <- counted$age == 20
counted[index,]
## # A tibble: 50 x 13
## name age gender raceethnicity armed date streetaddress city
## <chr> <dbl> <chr> <chr> <chr> <date> <chr> <chr>
## 1 Jani… 20 Female Black Knife 2015-02-19 Bellefonte Dr Char…
## 2 <NA> NA <NA> <NA> <NA> NA <NA> <NA>
## 3 Shaq… 20 Male Black Fire… 2015-03-02 1st Ave and … Joli…
## 4 Euge… 20 Male White Fire… 2015-03-17 13710 US Hwy… Onal…
## 5 Jami… 20 Male White Other 2015-03-19 Kneuman Rd Sumas
## 6 Todd… 20 Male Black Fire… 2015-04-24 1505 E Main … Trin…
## 7 Terr… 20 Male Black Other 2015-04-27 9500 Evergre… Detr…
## 8 Fera… 20 Male Arab-American No 2015-05-27 4600 E 15th … Long…
## 9 Tyro… 20 Male Black Fire… 2015-06-22 700 Saw Mill… Pitt…
## 10 Tyle… 20 Male White Fire… 2015-07-06 3300 SW 47th… Okla…
## # ... with 40 more rows, and 5 more variables: state <chr>,
## # latitude <dbl>, longitude <dbl>, classification <chr>,
## # lawenforcementagency <chr>
filter()
allows us to do this in a simpler manner. The first argument (suprise) is the dataframe that will be subsetted. Every following argument is a logical statement that will be applied to the dataset. Whenever the logical statement returns TRUE
that row will be returned as shown in the above base R code.
filter(counted, age == 20)
## # A tibble: 34 x 13
## name age gender raceethnicity armed date streetaddress city
## <chr> <dbl> <chr> <chr> <chr> <date> <chr> <chr>
## 1 Jani… 20 Female Black Knife 2015-02-19 Bellefonte Dr Char…
## 2 Shaq… 20 Male Black Fire… 2015-03-02 1st Ave and … Joli…
## 3 Euge… 20 Male White Fire… 2015-03-17 13710 US Hwy… Onal…
## 4 Jami… 20 Male White Other 2015-03-19 Kneuman Rd Sumas
## 5 Todd… 20 Male Black Fire… 2015-04-24 1505 E Main … Trin…
## 6 Terr… 20 Male Black Other 2015-04-27 9500 Evergre… Detr…
## 7 Fera… 20 Male Arab-American No 2015-05-27 4600 E 15th … Long…
## 8 Tyro… 20 Male Black Fire… 2015-06-22 700 Saw Mill… Pitt…
## 9 Tyle… 20 Male White Fire… 2015-07-06 3300 SW 47th… Okla…
## 10 Fred… 20 Male Black Fire… 2015-07-12 5130 E Ponce… Ston…
## # ... with 24 more rows, and 5 more variables: state <chr>,
## # latitude <dbl>, longitude <dbl>, classification <chr>,
## # lawenforcementagency <chr>
Now, we can add another condition on to this function call to get all of the female twenty year olds.
filter(counted, age == 20, gender == "Female")
## # A tibble: 1 x 13
## name age gender raceethnicity armed date streetaddress city
## <chr> <dbl> <chr> <chr> <chr> <date> <chr> <chr>
## 1 Jani… 20 Female Black Knife 2015-02-19 Bellefonte Dr Char…
## # ... with 5 more variables: state <chr>, latitude <dbl>, longitude <dbl>,
## # classification <chr>, lawenforcementagency <chr>
2.3 Chaining functions
The true power of the tidyverse
comes from it’s ability to chain functions after eachother. This is all enabled by the forward pipe operator %>%
. The pipe operator takes the output of a function and provides that output as the first argument in the following function. You’ve seen how the first argument for every function here has been the data this is done purposefully to enable the use of the pipe.
As always, the most helpful way to wrap your head around this is to see it in action. Let’s take one of the lines of code we used above and adapt it to use a pipe. We will select the name column of our data again. Previously we wrote select(data_frame, col_name)
.
select(counted, name)
## # A tibble: 2,226 x 1
## name
## <chr>
## 1 Matthew Ajibade
## 2 Lewis Lembke
## 3 Michael Kocher Jr
## 4 John Quintero
## 5 Tim Elliott
## 6 Matthew Hoffman
## 7 Kenneth Buck
## 8 Michael Rodriguez
## 9 Patrick Wetter
## 10 Brian Pickett
## # ... with 2,216 more rows
counted %>%
select(name)
## # A tibble: 2,226 x 1
## name
## <chr>
## 1 Matthew Ajibade
## 2 Lewis Lembke
## 3 Michael Kocher Jr
## 4 John Quintero
## 5 Tim Elliott
## 6 Matthew Hoffman
## 7 Kenneth Buck
## 8 Michael Rodriguez
## 9 Patrick Wetter
## 10 Brian Pickett
## # ... with 2,216 more rows
This gets the basic point across but doesn’t adequately illustrate the power. So let’s combine filter()
and select()
to get the names of all 20 year olds in our dataset. To do this we will first filter our dataset, then pipe it to our the select function.
counted %>%
filter(age == 20) %>%
select(name)
## # A tibble: 34 x 1
## name
## <chr>
## 1 Janisha Fonville
## 2 Shaquille Barrow
## 3 Eugene Smith II
## 4 Jamison Childress
## 5 Todd Dye
## 6 Terrance Kellom
## 7 Feras Morad
## 8 Tyrone Harris
## 9 Tyler Rogers
## 10 Frederick Farmer
## # ... with 24 more rows
glimpse(counted)
## Observations: 2,226
## Variables: 13
## $ name <chr> "Matthew Ajibade", "Lewis Lembke", "Micha...
## $ age <dbl> 22, 47, 19, 23, 53, 32, 22, 39, 25, 26, 3...
## $ gender <chr> "Male", "Male", "Male", "Male", "Male", "...
## $ raceethnicity <chr> "Black", "White", "White", "Hispanic/Lati...
## $ armed <chr> "No", "Firearm", "No", "No", "Firearm", "...
## $ date <date> 2015-01-01, 2015-01-02, 2015-01-03, 2015...
## $ streetaddress <chr> "1050 Carl Griffin Dr", "4505 SW Masters ...
## $ city <chr> "Savannah", "Aloha", "Kaumakani", "Wichit...
## $ state <chr> "GA", "OR", "HI", "KS", "WA", "CA", "AZ",...
## $ latitude <dbl> 32.06669, 45.48747, 21.93335, 37.69380, 4...
## $ longitude <dbl> -81.16788, -122.89170, -159.64197, -97.28...
## $ classification <chr> "Death in custody", "Gunshot", "Struck by...
## $ lawenforcementagency <chr> "Chatham County Sheriff's Office", "Washi...
2.4 Creating / manipulating data
Within most datasets there are what are called latent variables. These are varibles that can be created by manipulating one or more columns. The way we create new columns in a tidy workflow is by using the mutate()
function. This function allows us to assign column values directly or by manipulating existing columns.
The argument structure for mutate()
is quite simple. We first name the new column we are creating, then say it is equivalent to some statement. For example:
counted %>%
mutate(x = 1) %>%
select(x)
## # A tibble: 2,226 x 1
## x
## <dbl>
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
## 7 1
## 8 1
## 9 1
## 10 1
## # ... with 2,216 more rows
In this line of code we created a new column called x
which is set to the value of 1
and then we select just that column.
Next we will use mutate()
and functions from the lubridate
package to get the month, day, and year of the date of the individual’s death. As there is already a column called date
which is of the date type (class(counted$date)
) we can use the functions month()
, day()
, and year()
from lubridate
which return the integer corresponding with the date part we are trying to extract.
library(lubridate)
counted %>%
mutate(month = month(date),
day = day(date),
year = year(date)) %>%
# select date and our new columns
select(date, month, day, year)
## # A tibble: 2,226 x 4
## date month day year
## <date> <dbl> <int> <dbl>
## 1 2015-01-01 1 1 2015
## 2 2015-01-02 1 2 2015
## 3 2015-01-03 1 3 2015
## 4 2015-01-03 1 3 2015
## 5 2015-01-02 1 2 2015
## 6 2015-01-04 1 4 2015
## 7 2015-01-05 1 5 2015
## 8 2015-01-05 1 5 2015
## 9 2015-01-06 1 6 2015
## 10 2015-01-06 1 6 2015
## # ... with 2,216 more rows
The new year
column can be used to identify what year an individual was born. For this, we can subtract their age from the year of their death. Note we can use variables that were created in our mutate function call as we will do below. Since year
was created in the mutate, our new birth_year
variable must come after it.
counted_age <- counted %>%
mutate(month = month(date),
day = day(date),
year = year(date),
birth_year = year - age) %>%
select(age, birth_year)
counted_age
## # A tibble: 2,226 x 2
## age birth_year
## <dbl> <dbl>
## 1 22 1993
## 2 47 1968
## 3 19 1996
## 4 23 1992
## 5 53 1962
## 6 32 1983
## 7 22 1993
## 8 39 1976
## 9 25 1990
## 10 26 1989
## # ... with 2,216 more rows
A useful function for viewing our dataframes is arrange()
it allows us to sort the datset by a given column. By default is sorts in ascending order.
counted_age %>%
arrange(birth_year)
## # A tibble: 2,226 x 2
## age birth_year
## <dbl> <dbl>
## 1 87 1928
## 2 87 1929
## 3 85 1930
## 4 86 1930
## 5 86 1930
## 6 83 1932
## 7 84 1932
## 8 83 1933
## 9 82 1934
## 10 80 1936
## # ... with 2,216 more rows
To sort a dataframe in desecending order wrap the column name in the desc()
function (if the column is numeric you could also put a -
in front of the column name).
counted_age %>%
# arrange(desc(birth_year))
arrange(-birth_year)
## # A tibble: 2,226 x 2
## age birth_year
## <dbl> <dbl>
## 1 6 2009
## 2 10 2006
## 3 12 2004
## 4 13 2003
## 5 14 2002
## 6 15 2001
## 7 15 2001
## 8 15 2001
## 9 15 2000
## 10 15 2000
## # ... with 2,216 more rows