Chapter 3 Aggregation

Now that we have a basic understanding of how to manipulate our dataset, summarising the dataset into a few useful metrics is important. These summary stats will be useful in visualizing our data. There are three key functions that will be used count(), group_by(), and summarise(). When used creatively, group_by() and summarise() can provide some extremely insightful metrics.

3.1 Counting

We will start with count(). Counting is the most basic(and useful) ooperation that we can do. To count() the number of observations in a dataframe, just pass the the tibble to the count() function either directly or with a pipe (%>%).

library(tidyverse)
counted <- read_csv("data/the-counted/counted.csv")

## Parsed with column specification:
## cols(
##   name = col_character(),
##   age = col_double(),
##   gender = col_character(),
##   raceethnicity = col_character(),
##   armed = col_character(),
##   date = col_date(format = ""),
##   streetaddress = col_character(),
##   city = col_character(),
##   state = col_character(),
##   latitude = col_double(),
##   longitude = col_double(),
##   classification = col_character(),
##   lawenforcementagency = col_character()
## )

# count how many rows 
count(counted)

## # A tibble: 1 x 1
##       n
##   <int>
## 1  2226

Here we see that there are 2226 recorded deaths in this dataset. That is a lot. We can pass other column names as additional arguments to count subsets of our data. So if we pass the gender column to count() we can see how many deaths there are by reported gender (I hesitate to use this column title, I would usually opt for sex as it is less ambiguous read the wiki page here for more).

counted %>% 
  count(gender)

## # A tibble: 3 x 2
##   gender             n
##   <chr>          <int>
## 1 Female           110
## 2 Male            2115
## 3 Non-conforming     1

Ah, so here we see that the creators of this dataset are aware of gender non-conforming identities. We can also see that there seems to be a large gender discrepancy. These large numbers are informative, but often percentages or proportions are used to provide a bit more context. To extract this information we will need to utilise group_by() and summarise(). But first, we will go over group_by() briefly.

3.2 sub-setting

In the above example of count() we counted subsets of our data by telling count() which column to subset by. We can actually skip this step if we explicitly create a grouped data frame. This is done by using group_by(). As you probably guessed, we supply a column name to group by and create subsets from. On the surface this doesn’t do much. But if we group our dataframe and then look at the object’s class we will see that is has the class grouped_df, or grouped data frame.

 counted %>% 
  group_by(gender) %>% 
  class()

## [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

Now that we know how to group the data, lets try couting our groups.

counted %>% 
  group_by(gender) %>% 
  count()

## # A tibble: 3 x 2
## # Groups:   gender [3]
##   gender             n
##   <chr>          <int>
## 1 Female           110
## 2 Male            2115
## 3 Non-conforming     1

In the above code we get the same result as writing count(counted, gender), but in a slightly more verbose manner. But this is going to be useful in our summarisation!

3.3 Summarising

Here we see that there are 2226 recorded deaths in this dataset. That is a lot. But how many were there in each year? Currently, our dataset only has a column called date which contains the date of the event. Within that variable we can extract the year.

count, group_by, summarise,