Water Quality Analysis
This small tutorial was developed for a talk / workshop that Phil Bowsher gave at the EPA. This serves as a quick example of using the tidyverse for spatial analysis, modeling, and interactive mapping.
The source code and data can be found here.
EPA Water Quality Analysis
Data Cleaning
This section outlines the process needed for cleaning data taken from EPA.gov.
There are two datasets:
The first dataset contains data pertaining to water quality samples at a given site. The second data set contains information relating to that site such as latitude and longitude data. We will need to combine these datasets.
In order to join any two datasets there must be a common field(s). This case it is the site_id
.
The below code chunk:
- Load the
tidyverse
- Creates variables that store the URL of the csv files
- Read the datasets and standardizes the column names using the
clean_names()
function from janitor.
library(tidyverse)
# identify water quality csv
water_url <- "https://www.epa.gov/sites/production/files/2014-10/nla2007_chemical_conditionestimates_20091123.csv"
# site info csv (w lat lon data)
site_url <- "https://www.epa.gov/sites/production/files/2014-01/nla2007_sampledlakeinformation_20091113.csv"
# read sites
sites <- read_csv(site_url) %>%
janitor::clean_names()
# read water
water_q <- read_csv(water_url) %>%
janitor::clean_names()
Now that we have these datasets we will need to join them together.In this case we will join three tables together:
- Water Quality dataset
- Site location data
- State abbreviation and region data
We first take only a few columns of interest from the sites
dataset. This is then piped (%\>%
) into an inner_join()
(all columns from x and y where there is a match between x and y). The resultant table is then passed forward into a left_join()
(all columns from x and y where returning all rows from x). In this join the y table is explicitly created from the built in R objects state.abb
and state.region
. Then, a select()
statement is used to change some column names, select only the columns of interest. Finally, the tibble is written to the data
directory (run mkdir("data")
) if the directory does not exist.
# join together
clean <- select(sites, lon_dd, lat_dd, lakename, site_id, state_name, st) %>%
inner_join(water_q, by = "site_id") %>%
# join a table that has region info
left_join(
tibble(st = state.abb,
region = state.region), by = c("st.y" = "st")
) %>%
# select only data of interest
select(contains("_cond"), ptl, ntl, chla, st = st.x, region,
lon_dd = lon_dd.x, lat_dd = lat_dd.x, lakename)
#write_csv(clean, "data/water_quality.csv")
Exploratory analysis
water <- read_csv("https://raw.githubusercontent.com/JosiahParry/epa-water-quality/master/data/water_quality.csv")
glimpse(water)
## Observations: 1,442
## Variables: 14
## $ ptl_cond <chr> "1:LEAST DISTURBED", "2:INTERMEDIATE DISTURBANCE",…
## $ ntl_cond <chr> "1:LEAST DISTURBED", "2:INTERMEDIATE DISTURBANCE",…
## $ chla_cond <chr> "1:LEAST DISTURBED", "1:LEAST DISTURBED", "1:LEAST…
## $ turb_cond <chr> "1:LEAST DISTURBED", "1:LEAST DISTURBED", "1:LEAST…
## $ anc_cond <chr> "1:LEAST DISTURBED", "1:LEAST DISTURBED", "1:LEAST…
## $ salinity_cond <chr> "1:LEAST DISTURBED", "1:LEAST DISTURBED", "1:LEAST…
## $ ptl <dbl> 6, 36, 22, 36, 22, 43, 50, 43, 50, 18, 46, 18, 46,…
## $ ntl <dbl> 151, 695, 469, 695, 469, 738, 843, 738, 843, 344, …
## $ chla <dbl> 0.24, 3.84, 20.88, 3.84, 20.88, 16.96, 12.86, 16.9…
## $ st <chr> "MT", "SC", "SC", "SC", "SC", "TX", "TX", "TX", "T…
## $ region <chr> "West", "South", "South", "South", "South", "South…
## $ lon_dd <dbl> -114.02184, -79.98379, -79.98379, -79.98379, -79.9…
## $ lat_dd <dbl> 48.97903, 33.03606, 33.03606, 33.03606, 33.03606, …
## $ lakename <chr> "Lake Wurdeman", "Crane Pond", "Crane Pond", "Cran…
Use ggplot2 to explore the relationship between numeric variables.
water %>%
ggplot(aes(ptl, ntl)) +
geom_point(alpha = .25) +
theme_minimal()
Notice the fanning nature of the chart,this alludes to a log normal
distribution. Apply log transformations on both axes via
scale_x/y_log10()
.
water %>%
ggplot(aes(ptl, ntl)) +
geom_point(alpha = .25) +
theme_minimal() +
scale_x_log10() +
scale_y_log10()
Wonderful! Now there is a clear linear trend. Try applying a linear
regression to the data using geom_smooth()
water %>%
ggplot(aes(ptl, ntl)) +
geom_point(alpha = .25) +
theme_minimal() +
scale_x_log10() +
scale_y_log10() +
geom_smooth(method = "lm")
This is great! What are the values of our coefficient though? Fit a model.
mod <- lm(log10(ntl) ~ log10(ptl), data = water)
summary(mod)
##
## Call:
## lm(formula = log10(ntl) ~ log10(ptl), data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.28645 -0.16097 -0.01698 0.15165 1.07904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.98542 0.01664 119.34 <2e-16 ***
## log10(ptl) 0.54567 0.01014 53.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2707 on 1440 degrees of freedom
## Multiple R-squared: 0.668, Adjusted R-squared: 0.6677
## F-statistic: 2897 on 1 and 1440 DF, p-value: < 2.2e-16
How does this change for a single region? We can filter the data.
mod_west <- lm(log10(ntl) ~ log10(ptl), data = filter(water, region == "West"))
summary(mod_west)
##
## Call:
## lm(formula = log10(ntl) ~ log10(ptl), data = filter(water, region ==
## "West"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18505 -0.15729 -0.00431 0.14501 1.18044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.88402 0.02912 64.71 <2e-16 ***
## log10(ptl) 0.51157 0.01854 27.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2827 on 375 degrees of freedom
## Multiple R-squared: 0.6701, Adjusted R-squared: 0.6692
## F-statistic: 761.7 on 1 and 375 DF, p-value: < 2.2e-16
There is some variation in this. What about all other regions? We can use purrr
to create multiple models.
region_mods <- water %>%
nest(-region) %>%
mutate(mod = map(.x = data, .f = ~lm(log10(.x$ntl) ~ log10(.x$ptl))),
# create a nested tibble for model coefs
results = map(mod, broom::tidy),
# create nested tibble for model metrics
summary = map(mod, broom::glance))
## Warning: The `printer` argument is deprecated as of rlang 0.3.0.
## This warning is displayed once per session.
We can unnest different tibbles. For the coefficients unnest the results
.
unnest(region_mods, results)
## # A tibble: 8 x 6
## region term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 West (Intercept) 1.88 0.0291 64.7 1.47e-205
## 2 West log10(.x$ptl) 0.512 0.0185 27.6 2.52e- 92
## 3 South (Intercept) 1.99 0.0336 59.3 3.22e-186
## 4 South log10(.x$ptl) 0.509 0.0203 25.1 1.16e- 80
## 5 Northeast (Intercept) 2.10 0.0324 65.0 5.09e-122
## 6 Northeast log10(.x$ptl) 0.417 0.0294 14.2 1.21e- 30
## 7 North Central (Intercept) 2.15 0.0280 76.7 2.15e-290
## 8 North Central log10(.x$ptl) 0.533 0.0154 34.5 4.18e-138
Mapping data
To map data we can take advantage of leaflet
and sf
. We will create a simple feature object which has a column containing geometry information. We use st_as_sf()
to convert to a spatial object. Use the argument coords
to tell which columns correspond to latitude and logitude.
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.1.3, PROJ 4.9.3
water_sf <- water %>%
st_as_sf(coords = c("lon_dd", "lat_dd"))
class(water_sf)
## [1] "sf" "tbl_df" "tbl" "data.frame"
## [1] "sf" "tbl_df" "tbl" "data.frame"
You can see now that this is still a data frame but is also of class sf
.
We can use this sf object to plot some markers with leaflet.
library(leaflet)
leaflet(water_sf) %>%
addTiles() %>%
addMarkers()
This creates markers for each measurement, but it would be nice to have a popup message associated with each one. We can create a message with mutate()
and glue()
. Note that the <br>
tag is an html tag that creates a new line.
water_sf %>%
mutate(msg = glue::glue({
"Name: {lakename}<br>
Chlorphylla: {chla}<br>
Nitrogen: {ntl}<br>
Phosphorus: {ptl}<br>"})) %>%
leaflet() %>%
addTiles() %>%
addMarkers(popup = ~msg)