Web-scraping for Campaigns
As the primaries approach, I am experiencing a mix of angst, FOMO, and excitement. One of my largest concerns is that progressive campaigns are stuck in a sort of antiquated but nonetheless entrenched workflow. Google Sheets reign in metric reporting. Here I want to present one use case (of a few more to come) where R can be leveraged by your data team.
In this post I show you how to scrape the most recent polling data from FiveThirtyEight. FiveThirtyEight aggregates this data in an available way. This can allow you as a Data Manager to provide a useful report to your Media Manager.
As always, please feel free to contact me on Twitter @josiahparry if you have any questions or want to discuss this further.
Polling use case
A very important metric to keep track of is how your candidate is polling. Are they gaining a lead in the polls or falling behind? This data is often reported via traditional news organizations or some other mediums. The supposed demi-God and mythical pollster Nate Silver’s organization FiveThirtyEight does a wonderful job aggregating polls. Their page National 2020 Democratic Presidential Primary Polls has a table of the most recent polls from many different pollsters.
In this use case we will acquire this data by web scraping using rvest
. We will also go over ways to programatically save polls results to a text file. Saving polling results can allow you present a long term view of your candidate’s growth during the quarter.
Understanding rvest
This use case will provide a cursory overview of the package rvest
. To learn more go here.
Web scraping is the process of extracting data from a website. Websites are written in HTML and CSS. There are a few aspects of these languages that are used in web scraping that is important to know. HTML is written in a series of what are call tags. A tag is a set of characters wrapped in angle brackets—i.e. <img>
.
With CSS (cascading style sheets), web developers can give unique identifiers to a tag. Classes can also be assigned to a tag. Think of these as group. With web scraping we can specify a particular part of a website by it’s HTML tag and perhaps it’s class or ID. rvest
provides a large set of functions to make this simpler.
Example
For this example we will be scraping FiveThirtyEight’s aggregated poll table. The table can be found at https://projects.fivethirtyeight.com/2020-primaries/democratic/national/.
Before we begin, we must always prepare our workspace. Mise en place.
library(rvest)
## Loading required package: xml2
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.0.9000
## ✔ tibble 2.1.1 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.4.0
## ✔ readr 1.2.1 ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::pluck() masks rvest::pluck()
The first thing we will have to do is specify what page we will be scraping from. html_session()
will simulate a session in an html browser. By providing a URL to html_session()
we will then be able to access the underlying code of that page. Create an object called session
by providing the FiveThirtyEight URL to html_session()
.
session <- html_session("https://projects.fivethirtyeight.com/2020-primaries/democratic/national/")
The next and most important step is to identify which piece of HTML code contains the table. The easiest way to do this is to open up the webpage in Chrome and open up the Inspect Elements view (on Mac - ⌘ + Shift + C). Now that this is open, click the select element button at the top left corner of the inspection pane. Now hover over the table.
You will see that the HTML element is highlighted. We can see that it is a table
tag. Additionally we see that there are two different classes polls-table
and tracker
. To specify a class we put a preceding .
to the class name—i.e. .class-name
. If there are multiple classes we just append the second class name to it—i.e. .first-class.second-class
. Be aware that these selectors can be quite finicky and be a bit difficult to figure out. You might need to do some googling or playing around with the selector.
To actually access the content of this HTML element, we must specify the element using the proper selector. html_node()
will be used to do this. Provide the html session and the CSS selector to html_node()
to extract the HTML element.
session %>%
html_node(".polls-table.tracker")
## {xml_node}
## <table class="polls-table tracker">
## [1] <thead class="hide-mobile" id="table-header"><tr>\n<th class="new">< ...
## [2] <tbody>\n<tr class="visible-row" data-id="97723">\n<!-- Shared--><td ...
Here we see that this returns on object of class xml_node
. This object returns some HTML code but it is still not entirely workable. Since this is an HTML table we want to extract we can use the handy html_table()
. Note that if this wasn’t a table but rather text, you can use html_text()
.
session %>%
html_node(".polls-table.tracker") %>%
html_table()
Take note of the extremely informative error. It appears we might have to deal with mismatching columns.
session %>%
html_node(".polls-table.tracker") %>%
html_table(fill = TRUE) %>%
head()
## Dates Pollster Sample
## 1 • Jun 6-10, 2019503 RV Jun 6-10, 2019 A-Quinnipiac University
## 2 • Jun 3-9, 201917,012 LV Jun 3-9, 2019 B-Morning Consult
## 3 • May 29-Jun 5, 20192,271 RV May 29-Jun 5, 2019 B+Ipsos
## 4 • May 29-Jun 5, 20192,525 A May 29-Jun 5, 2019 B+Ipsos
## 5 • Jun 2-4, 2019500 RV Jun 2-4, 2019 BYouGov
## 6 • Jun 2-4, 2019550 A Jun 2-4, 2019 BYouGov
## Sample Biden Sanders Harris Warren O'Rourke Buttigieg Booker Klobuchar
## 1 503 RV 30% 19% 7% 15% 3% 8% 1%
## 2 17,012 LV 37% 19% 7% 11% 4% 7% 3%
## 3 2,271 RV 31% 14% 6% 9% 3% 5% 2%
## 4 2,525 A 30% 15% 6% 8% 4% 5% 2%
## 5 500 RV 27% 15% 9% 12% 2% 10% 2%
## 6 550 A 27% 16% 8% 11% 2% 9% 2%
## Castro Yang Gillibrand Hickenlooper Gabbard Delaney Inslee Ryan Bullock
## 1 1% 0% 1% 0% 0% 0% 0% 0% 1%
## 2 2% 1% 1% 1% 1% 1% 1% 1% 1%
## 3 2% 1% 1% 0% 1% 1% 0% 0% 1%
## 4 1% 0% 1% 0% 1% 1% 0% 0% 1%
## 5 1% 0% 1% 0% 1% 1% 1% 0% 0%
## 6 1% 0% 1% 0% 1% 1% 1% 0% 0%
## de Blasio Bennet Williamson Gravel Swalwell Moulton Messam H. Clinton
## 1 0% 0% 0% 0% 0% 0% 0% 0%
## 2 1% 0% 1% 0% 0% 0%
## 3 0% 1% 0% 0% 0% 0% 0% 0%
## 4 0% 1% 0% 0% 0% 0% 0% 0%
## 5 1% 2% 0% 0% 0% 0% 0% 0%
## 6 1% 2% 0% 0% 0% 0% 0% 0%
## Bloomberg M. Obama Brown Kerry Abrams Holder McAuliffe Winfrey Ojeda
## 1
## 2
## 3
## 4
## 5
## 6
## Trump Cuomo Avenatti Kennedy Patrick Zuckerberg Pelosi Garcetti Newsom
## 1
## 2
## 3
## 4
## 5
## 6
## Steyer Schultz Kaine Johnson Kucinich Lee Scott Sinema Warner NA
## 1
## 2
## 3
## 4
## 5
## 6
## NA
## 1 Biden30%Sanders19%Warren15%Buttigieg8%Harris7%O'Rourke3%Booker1%Klobuchar1%Yang1%Ryan1%Gillibrand0%Castro0%Gabbard0%Inslee0%Hickenlooper0%Delaney0%Williamson0%Messam0%Swalwell0%Moulton0%Bennet0%Bullock0%de Blasio0%Gravel0%
## 2 Biden37%Sanders19%Warren11%Buttigieg7%Harris7%O'Rourke4%Booker3%Klobuchar2%Bennet1%Bullock1%Castro1%Delaney1%Gabbard1%Gillibrand1%Hickenlooper1%Inslee1%Yang1%Ryan1%de Blasio0%Moulton0%Swalwell0%Williamson0%
## 3 Biden31%Sanders14%Warren9%Harris6%Buttigieg5%O'Rourke3%Booker2%Klobuchar2%Castro1%Gabbard1%Hickenlooper1%Yang1%Ryan1%de Blasio1%Gillibrand0%Bullock0%Inslee0%Delaney0%Williamson0%Messam0%Swalwell0%Moulton0%Bennet0%Gravel0%
## 4 Biden30%Sanders15%Warren8%Harris6%Buttigieg5%O'Rourke4%Booker2%Klobuchar1%Gabbard1%Hickenlooper1%Yang1%Ryan1%de Blasio1%Castro0%Gillibrand0%Bullock0%Inslee0%Delaney0%Williamson0%Messam0%Swalwell0%Moulton0%Bennet0%Gravel0%
## 5 Biden27%Sanders15%Warren12%Buttigieg10%Harris9%Booker2%de Blasio2%O'Rourke2%Bullock1%Delaney1%Gabbard1%Hickenlooper1%Klobuchar1%Yang1%Bennet0%Castro0%Gillibrand0%Gravel0%Inslee0%Messam0%Moulton0%Ryan0%Swalwell0%Williamson0%
## 6 Biden27%Sanders16%Warren11%Buttigieg9%Harris8%Booker2%de Blasio2%O'Rourke2%Bullock1%Delaney1%Gabbard1%Hickenlooper1%Klobuchar1%Yang1%Bennet0%Castro0%Gillibrand0%Gravel0%Inslee0%Messam0%Moulton0%Ryan0%Swalwell0%Williamson0%
This is much better! But based on visual inspection the column headers are not properly matched. There are a few things that need to be sorted out: there are two date columns, there are commas and percents where numeric columns should be, the column headers are a little messy, and the table isn’t a tibble (this is just personal preference).
We will handle the final two issues first as they are easiest to deal with. The function clean_names()
from janitor
will handle the column headers, and as_tibble()
will coerce the data.frame into a proper tibble. Save this semi-clean tibble into an object called polls
.
polls <- session %>%
html_node(".polls-table.tracker") %>%
html_table(fill = TRUE) %>%
janitor::clean_names() %>%
as_tibble()
polls
## # A tibble: 116 x 59
## x dates pollster sample sample_2 biden sanders harris warren
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 • Jun … Jun 6-1… A-Qui… 503 RV 30% 19% 7%
## 2 • Jun … Jun 3-9… B-Mor… 17,012 LV 37% 19% 7%
## 3 • May … May 29-… B+Ips… 2,271 RV 31% 14% 6%
## 4 • May … May 29-… B+Ips… 2,525 A 30% 15% 6%
## 5 • Jun … Jun 2-4… BYouG… 500 RV 27% 15% 9%
## 6 • Jun … Jun 2-4… BYouG… 550 A 27% 16% 8%
## 7 • Jun … Jun 1-2… C+Har… 431 RV 35% 16% 4%
## 8 • May … May 27-… B-Mor… 16,587 LV 38% 19% 7%
## 9 • May … May 28-… A-CNN… 412 RV 32% 18% 8%
## 10 • May … May 29-… C+Har… 471 RV 36% 17% 8%
## # … with 106 more rows, and 50 more variables: o_rourke <chr>,
## # buttigieg <chr>, booker <chr>, klobuchar <chr>, castro <chr>,
## # yang <chr>, gillibrand <chr>, hickenlooper <chr>, gabbard <chr>,
## # delaney <chr>, inslee <chr>, ryan <chr>, bullock <chr>,
## # de_blasio <chr>, bennet <chr>, williamson <chr>, gravel <chr>,
## # swalwell <chr>, moulton <chr>, messam <chr>, h_clinton <chr>,
## # bloomberg <chr>, m_obama <chr>, brown <chr>, kerry <chr>,
## # abrams <chr>, holder <chr>, mc_auliffe <chr>, winfrey <chr>,
## # ojeda <chr>, trump <chr>, cuomo <chr>, avenatti <chr>, kennedy <chr>,
## # patrick <chr>, zuckerberg <chr>, pelosi <chr>, garcetti <chr>,
## # newsom <chr>, steyer <chr>, schultz <chr>, kaine <chr>, johnson <chr>,
## # kucinich <chr>, lee <chr>, scott <chr>, sinema <chr>, warner <chr>,
## # na <chr>, na_2 <chr>
We want to shift over the column names to the right just once. Unfortunately there is no elegant way to do this (that I am aware of). We can see that the first column is completely useless so that can be removed. Once that column is removed we can reset the names this way they will be well aligned.
We will start by creating a vector of the original column names.
col_names <- names(polls)
col_names
## [1] "x" "dates" "pollster" "sample"
## [5] "sample_2" "biden" "sanders" "harris"
## [9] "warren" "o_rourke" "buttigieg" "booker"
## [13] "klobuchar" "castro" "yang" "gillibrand"
## [17] "hickenlooper" "gabbard" "delaney" "inslee"
## [21] "ryan" "bullock" "de_blasio" "bennet"
## [25] "williamson" "gravel" "swalwell" "moulton"
## [29] "messam" "h_clinton" "bloomberg" "m_obama"
## [33] "brown" "kerry" "abrams" "holder"
## [37] "mc_auliffe" "winfrey" "ojeda" "trump"
## [41] "cuomo" "avenatti" "kennedy" "patrick"
## [45] "zuckerberg" "pelosi" "garcetti" "newsom"
## [49] "steyer" "schultz" "kaine" "johnson"
## [53] "kucinich" "lee" "scott" "sinema"
## [57] "warner" "na" "na_2"
Unfortunately this also presents another issue. Once a column is deselected, there will be one more column name than column. So we will need to select all but the last element of the original names. We will create a vector called new_names
.
# identify the integer number of the last column
last_col <- length(col_names) - 1
# create a vector which will be used for the new names
new_names <- col_names[1:last_col]
Now we can try implementing the hacky solution. Here we will deselect the first column and reset the names using setNames()
. Following, we will use the mutate_at()
variant to remove the percent sign from every candidate column and coerce them into integer columns. Here we will specify which variables to not mutate at within vars()
.
polls %>%
select(-1) %>%
setNames(new_names)%>%
select(-1) %>%
mutate_at(vars(-c("dates", "pollster", "sample", "sample_2")),
~as.integer(str_remove(., "%")))
## Warning in (structure(function (..., .x = ..1, .y = ..2, . = ..1) : NAs
## introduced by coercion
## # A tibble: 116 x 57
## dates pollster sample sample_2 biden sanders harris warren o_rourke
## <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int>
## 1 Jun … A-Quinn… 503 RV 30 19 7 15 3
## 2 Jun … B-Morni… 17,012 LV 37 19 7 11 4
## 3 May … B+Ipsos 2,271 RV 31 14 6 9 3
## 4 May … B+Ipsos 2,525 A 30 15 6 8 4
## 5 Jun … BYouGov 500 RV 27 15 9 12 2
## 6 Jun … BYouGov 550 A 27 16 8 11 2
## 7 Jun … C+Harri… 431 RV 35 16 4 5 4
## 8 May … B-Morni… 16,587 LV 38 19 7 10 4
## 9 May … A-CNN/S… 412 RV 32 18 8 7 5
## 10 May … C+Harri… 471 RV 36 17 8 5 4
## # … with 106 more rows, and 48 more variables: buttigieg <int>,
## # booker <int>, klobuchar <int>, castro <int>, yang <int>,
## # gillibrand <int>, hickenlooper <int>, gabbard <int>, delaney <int>,
## # inslee <int>, ryan <int>, bullock <int>, de_blasio <int>,
## # bennet <int>, williamson <int>, gravel <int>, swalwell <int>,
## # moulton <int>, messam <int>, h_clinton <int>, bloomberg <int>,
## # m_obama <int>, brown <int>, kerry <int>, abrams <int>, holder <int>,
## # mc_auliffe <int>, winfrey <int>, ojeda <int>, trump <int>,
## # cuomo <int>, avenatti <int>, kennedy <int>, patrick <int>,
## # zuckerberg <int>, pelosi <int>, garcetti <int>, newsom <int>,
## # steyer <int>, schultz <int>, kaine <int>, johnson <int>,
## # kucinich <int>, lee <int>, scott <int>, sinema <int>, warner <int>,
## # na <int>
Now we must tidy the data. We will use tidyr::gather()
to transform the data from wide to long. In short, gather takes the column headers (the key
argument) and creates a new variable from the values of the columns (the value
argument). In this case, we will create a new column called candidate
from the column headers and a second column called points
which are a candidates polling percentage. Next we deselect any columns that we do not want to be gathered.
polls %>%
select(-1) %>%
setNames(new_names)%>%
select(-1) %>%
mutate_at(vars(-c("dates", "pollster", "sample", "sample_2")),
~as.integer(str_remove(., "%"))) %>%
gather(candidate, points, -dates, -pollster, -sample, -sample_2)
## Warning in (structure(function (..., .x = ..1, .y = ..2, . = ..1) : NAs
## introduced by coercion
## # A tibble: 6,148 x 6
## dates pollster sample sample_2 candidate points
## <chr> <chr> <chr> <chr> <chr> <int>
## 1 Jun 6-10, 2019 A-Quinnipiac Univers… 503 RV biden 30
## 2 Jun 3-9, 2019 B-Morning Consult 17,012 LV biden 37
## 3 May 29-Jun 5, 20… B+Ipsos 2,271 RV biden 31
## 4 May 29-Jun 5, 20… B+Ipsos 2,525 A biden 30
## 5 Jun 2-4, 2019 BYouGov 500 RV biden 27
## 6 Jun 2-4, 2019 BYouGov 550 A biden 27
## 7 Jun 1-2, 2019 C+HarrisX 431 RV biden 35
## 8 May 27-Jun 2, 20… B-Morning Consult 16,587 LV biden 38
## 9 May 28-31, 2019 A-CNN/SSRS 412 RV biden 32
## 10 May 29-30, 2019 C+Harris Interactive 471 RV biden 36
## # … with 6,138 more rows
There are a few more house-keeping things that need to be done to improve this data set. sample_2
is rather uninformative. On the FiveThirtyEight website there is a key which describes what these values represent (A = ADULTS, RV = REGISTERED VOTERS, V = VOTERS, LV = LIKELY VOTERS
). This should be specified in our data set. In addition the sample
column ought to be cast into an integer column. And finally, those messy dates will need to be cleaned. My approach to this requires creating a function to handle this cleaning. First, the simple stuff.
To do the first two above steps, we will continue our function chain and save it to a new variable polls_tidy
.
polls_tidy <- polls %>%
select(-1) %>%
setNames(new_names)%>%
select(-1) %>%
mutate_at(vars(-c("dates", "pollster", "sample", "sample_2")),
~as.integer(str_remove(., "%"))) %>%
gather(candidate, points, -dates, -pollster, -sample, -sample_2) %>%
mutate(sample_2 = case_when(
sample_2 == "RV" ~ "Registered Voters",
sample_2 == "LV" ~ "Likely Voters",
sample_2 == "A" ~ "Adults",
sample_2 == "V" ~ "Voters"
),
sample = as.integer(str_remove(sample, ",")))
## Warning in (structure(function (..., .x = ..1, .y = ..2, . = ..1) : NAs
## introduced by coercion
polls_tidy
## # A tibble: 6,148 x 6
## dates pollster sample sample_2 candidate points
## <chr> <chr> <int> <chr> <chr> <int>
## 1 Jun 6-10, 2019 A-Quinnipiac Univ… 503 Registered Vo… biden 30
## 2 Jun 3-9, 2019 B-Morning Consult 17012 Likely Voters biden 37
## 3 May 29-Jun 5,… B+Ipsos 2271 Registered Vo… biden 31
## 4 May 29-Jun 5,… B+Ipsos 2525 Adults biden 30
## 5 Jun 2-4, 2019 BYouGov 500 Registered Vo… biden 27
## 6 Jun 2-4, 2019 BYouGov 550 Adults biden 27
## 7 Jun 1-2, 2019 C+HarrisX 431 Registered Vo… biden 35
## 8 May 27-Jun 2,… B-Morning Consult 16587 Likely Voters biden 38
## 9 May 28-31, 20… A-CNN/SSRS 412 Registered Vo… biden 32
## 10 May 29-30, 20… C+Harris Interact… 471 Registered Vo… biden 36
## # … with 6,138 more rows
Date cleaning
Next we must work to clean the date field. I find that when working with a messy column, creating a single function which handles the cleaning is one of the most effective approaches. Here we will create a function which takes a value provided from the dates
field and return a cleaned date. There are two unique cases I identified. There are poll dates which occurred during a single month, or a poll that spanned two months. The dates are separated by a single hyphen -
. If we split the date at -
we will either receive two elements with a month indicated or one month with a day and a day number. In the latter case we will have to carry over the month. Then the year can be appended to it and parsed as a date using the lubridate
package. For more on lubridate
visit here.
The function will only return one date at a time. The two arguments will be date
and .return
to indicate whether the first or second date should be provided. The internals of this function rely heavily on the stringr
package (see R for Data Science Chapter 14). switch()
at the end of the function determines which date should be returned (see Advanced R Chapter 5).
clean_date <- function(date, .return = "first") {
# take date and split at the comma to get the year and the month-day combo
date_split <- str_split(date, ",") %>%
# remove from list / coerce to vector
unlist() %>%
# remove extra white space
str_trim()
# extract the year
date_year <- date_split[2]
# split the month day portion and coerce to vector
dates <- unlist(str_split(date_split[1], "-"))
# paste the month day and year together then parse as date using `mdy()`
first_date <- paste(dates[1], date_year) %>%
lubridate::mdy()
second_date <- ifelse(!str_detect(dates[2], "[A-z]+"),
yes = paste(str_extract(dates[1], "[A-z]+"),
dates[2],
date_year),
no = paste(dates[2], date_year)) %>%
lubridate::mdy()
switch(.return,
first = return(first_date),
second = return(second_date)
)
}
# test on a date
clean_date(polls_tidy$dates[10], .return = "first")
## [1] "2019-05-29"
clean_date(polls_tidy$dates[10], .return = "second")
## [1] "2019-05-30"
We can use this new function to create two new columns poll_start
and poll_end
using mutate()
. Following this we can deselect the original dates
column, remove any observations missing a points
value, remove duplicates using distinct()
, and save this to polls_clean
.
polls_clean <- polls_tidy %>%
mutate(poll_start = clean_date(dates, "first"),
poll_end = clean_date(dates, "second")) %>%
select(-dates) %>%
filter(!is.na(points)) %>%
distinct()
polls_clean
## # A tibble: 1,955 x 7
## pollster sample sample_2 candidate points poll_start poll_end
## <chr> <int> <chr> <chr> <int> <date> <date>
## 1 A-Quinnipiac … 503 Registered… biden 30 2019-06-06 2019-06-10
## 2 B-Morning Con… 17012 Likely Vot… biden 37 2019-06-06 2019-06-10
## 3 B+Ipsos 2271 Registered… biden 31 2019-06-06 2019-06-10
## 4 B+Ipsos 2525 Adults biden 30 2019-06-06 2019-06-10
## 5 BYouGov 500 Registered… biden 27 2019-06-06 2019-06-10
## 6 BYouGov 550 Adults biden 27 2019-06-06 2019-06-10
## 7 C+HarrisX 431 Registered… biden 35 2019-06-06 2019-06-10
## 8 B-Morning Con… 16587 Likely Vot… biden 38 2019-06-06 2019-06-10
## 9 A-CNN/SSRS 412 Registered… biden 32 2019-06-06 2019-06-10
## 10 C+Harris Inte… 471 Registered… biden 36 2019-06-06 2019-06-10
## # … with 1,945 more rows
Visualization
The cleaned data can be aggregated and visualized.
avg_polls <- polls_clean %>%
group_by(candidate) %>%
summarise(avg_points = mean(points, na.rm = TRUE),
min_points = min(points, na.rm = TRUE),
max_points = max(points, na.rm = TRUE),
n_polls = n() - sum(is.na(points))) %>% # identify how many polls candidate is in
# remove candidates who appear in 50 or fewer polls: i.e. HRC
filter(n_polls > 50) %>%
arrange(-avg_points)
avg_polls
## # A tibble: 16 x 5
## candidate avg_points min_points max_points n_polls
## <chr> <dbl> <dbl> <dbl> <int>
## 1 biden 32.1 9 66 108
## 2 sanders 19.4 4 42 110
## 3 harris 8.62 2 38 110
## 4 warren 8.05 2 43 109
## 5 o_rourke 6.50 2 21 105
## 6 buttigieg 4.88 0 21 80
## 7 booker 3.30 0 9 105
## 8 klobuchar 1.77 0 5 94
## 9 yang 1.13 0 3 62
## 10 castro 1.10 0 12 97
## 11 gillibrand 0.929 0 9 98
## 12 hickenlooper 0.728 0 2 81
## 13 gabbard 0.723 0 3 83
## 14 delaney 0.530 0 8 83
## 15 bullock 0.509 0 1 55
## 16 inslee 0.429 0 2 77
avg_polls %>%
mutate(candidate = fct_reorder(candidate, avg_points)) %>%
ggplot(aes(candidate, avg_points)) +
geom_col() +
theme_minimal() +
coord_flip() +
labs(title = "Polls Standings", x = "", y = "%")
Creating historic polling data
It may become useful to have a running history of how candidates have been polling. We can use R to write a csv file of the data from FiveThirtyEight. However, what happens when the polls update? How we can we keep the previous data and the new data? We will work through an example using a combination of bind_rows()
and distinct()
. I want to emphasize that this is not a good practice if you need to scale to hundreds of thousand of rows. This works in this case as the data are inherently small.
To start, I have created a sample dataset which contains 80% of these polls (maybe less by the time you do this!). Note that is probably best to version control this or have multiple copies as a failsafe.
The approach we will take is to read in the historic polls data set and bind rows with the polls_clean
data we have scraped. Next we remove duplicate rows using distinct()
.
old_polls <- read_csv("https://raw.githubusercontent.com/JosiahParry/r-4-campaigns/master/data/polls.csv")
## Parsed with column specification:
## cols(
## pollster = col_character(),
## sample = col_double(),
## sample_2 = col_character(),
## candidate = col_character(),
## points = col_double(),
## poll_start = col_date(format = ""),
## poll_end = col_date(format = "")
## )
old_polls
## # A tibble: 1,564 x 7
## pollster sample sample_2 candidate points poll_start poll_end
## <chr> <dbl> <chr> <chr> <dbl> <date> <date>
## 1 C+HarrisX 370 Registered… klobuchar 2 2019-06-06 2019-06-10
## 2 C+HarrisX 448 Registered… gillibra… 1 2019-06-06 2019-06-10
## 3 B-Morning Con… 11627 Likely Vot… harris 13 2019-06-06 2019-06-10
## 4 B-Morning Con… 699 Registered… delaney 0 2019-06-06 2019-06-10
## 5 C+HarrisX 743 Registered… williams… 1 2019-06-06 2019-06-10
## 6 A-Quinnipiac … 559 Registered… gabbard 0 2019-06-06 2019-06-10
## 7 B-Morning Con… 14250 Likely Vot… gillibra… 2 2019-06-06 2019-06-10
## 8 A-Quinnipiac … 559 Registered… gillibra… 0 2019-06-06 2019-06-10
## 9 B-Morning Con… 14250 Likely Vot… harris 6 2019-06-06 2019-06-10
## 10 A+Monmouth Un… 330 Registered… warren 8 2019-06-06 2019-06-10
## # … with 1,554 more rows
updated_polls <- bind_rows(old_polls, polls_clean) %>%
distinct()
updated_polls
## # A tibble: 1,955 x 7
## pollster sample sample_2 candidate points poll_start poll_end
## <chr> <dbl> <chr> <chr> <dbl> <date> <date>
## 1 C+HarrisX 370 Registered… klobuchar 2 2019-06-06 2019-06-10
## 2 C+HarrisX 448 Registered… gillibra… 1 2019-06-06 2019-06-10
## 3 B-Morning Con… 11627 Likely Vot… harris 13 2019-06-06 2019-06-10
## 4 B-Morning Con… 699 Registered… delaney 0 2019-06-06 2019-06-10
## 5 C+HarrisX 743 Registered… williams… 1 2019-06-06 2019-06-10
## 6 A-Quinnipiac … 559 Registered… gabbard 0 2019-06-06 2019-06-10
## 7 B-Morning Con… 14250 Likely Vot… gillibra… 2 2019-06-06 2019-06-10
## 8 A-Quinnipiac … 559 Registered… gillibra… 0 2019-06-06 2019-06-10
## 9 B-Morning Con… 14250 Likely Vot… harris 6 2019-06-06 2019-06-10
## 10 A+Monmouth Un… 330 Registered… warren 8 2019-06-06 2019-06-10
## # … with 1,945 more rows
Now you have a cleaned data set which has been integrated with the recently scraped data. Write this to a csv using write_csv()
for later use.