Python & R in production — the API way
In my previous post I discussed how we can alter the R & Python story to be predicated on APIs as a way to bridge the language divide. The R & Python love story feels almost like unrequited love (h/t). Much of the development towards integrating the two languages has been heavily focused on the R user experience. While the developments with respect to reticulate have been enormous and cannot go understated, it might be worthwhile exploring another way in which R & Python and, for that matter, Python & R can be utilized together.
By shifting from language based tools that call the other language and translate their objects like reticulate and rpy2, to APIs we can develop robust language agnostic data science pipelines. I want to provide two motivating examples that explore the interplay between R & Python.
Calling Python from R (without reticulate)
When we talk about R & Python we typically are referring to reticulate, whether that be through python code chunks, the {tensorflow} package, or reticulate itself. However, as discussed in my previous post, another way that we can do this is via API. Flask can be used to create RESTful APIs.
On the RStudio Connect demo server there is a Flask app which provides historical stock prices for a few tickers. We can create a simple windowed summary visualization utilizing the Flask app, httr, dplyr, and ggplot2. Let’s break this down. First we use the httr library to send an HTTP request to the Flask app.
library(httr)
library(tidyverse)
flask_call <- "https://colorado.rstudio.com/rsc/flask-stock-service/stocks/AAPL/history"
aapl <- GET(flask_call) %>%
content(as = "text", encoding = "utf-8") %>%
jsonlite::fromJSON() %>%
as_tibble()
By sending this HTTP request, we are then kicking off a Python process which returns our dataset. We can then use dplyr to aggregate our dataset as we normally would.
aapl_yearly <- aapl %>%
group_by(year = lubridate::year(date)) %>%
summarise(avg_adj = mean(adjusted))
head(aapl_yearly)
#> # A tibble: 6 x 2
#> year avg_adj
#> <dbl> <dbl>
#> 1 2010 24.9
#> 2 2011 34.8
#> 3 2012 56.1
#> 4 2013 53.1
#> 5 2014 84.3
#> 6 2015 113.
Finally we utilize ggplot2 to create the simple visualization.
ggplot(aapl_yearly, aes(year, avg_adj)) +
geom_line() +
labs(title = "AAPL stock growth", x = "", y = "Average Adjusted") +
scale_y_continuous(labels = scales::dollar) +
theme_minimal()
That was simple, right? In the above code chunks we utilized both R and python while only interacting and writing R code. That’s the brilliance of this approach.
Calling R from Python
The less often discussed part of this love story—hence unrequited love story—is how can Python users utilize R within their own workflows. Often machine learning engineers will use Python in combination with Scikit Learn to create their models. To illustrate how we can let both R and Python users shine I wanted to adapt the wonderful Bike Prediction example from the Solutions Engineering team at RStudio.
The Bike Prediction project is an example of orchestrating a number of data science artifacts into a holistic system on RStudio Connect that all work in unity. This example could just as well have been written entirely with Python. It could even be written as a combination of both R and Python. And that is what I’d like to illustrate.
The bike prediction app utilizes a custom R package and the power of dbplyr to perform scheduled ETL jobs. It is effective, efficient, and already deployed. Say one has a colleague who would like to create a new machine learning model using the same data how can we enable them to do so? The example works within the context of its own R Markdown that retrains the model. Rather than making a one time export of the data from the ETL process, we can make the data available consistently through a RESTful API hosted here.
The training and testing data have been made available through a plumber API that is hosted on RStudio Connect. With the data being available through an API, all that is needed to interact with it is the requests
library. Everything else is as one would anticipate!
import requests
import pandas as pd
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
In the below code chunk we call the Plumber API using an HTTP request which kicks off an R process. That R process utilizes dbplyr and lubridate to extract and partition data for training and testing.
# Fetch data from Plumber API
test_dat_raw = requests.get("https://colorado.rstudio.com/rsc/bike-data-api/testing-data")
test_dat = pd.read_json(test_dat_raw.text)
train_dat_raw = requests.get("https://colorado.rstudio.com/rsc/bike-data-api/training-data")
train_dat = pd.read_json(train_dat_raw.text)
Now that the data have been processed by R and loaded as a pandas dataframe the model training can continue as standard.
# partition data and one hot encode day of week
train_x = pd.concat([train_dat[["hour", "month", "lat", "lon"]], pd.get_dummies(train_dat.dow)], axis = 1)
train_y = train_dat["n_bikes"]
# instantiate xgb model object
model = XGBRegressor()
# fit the model with the training data
model.fit(train_x,train_y)
# predict the target on the test dataset
test_dow = pd.get_dummies(pd.Categorical(test_dat.dow, categories = ['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday']))
test_x = pd.concat([test_dat[["hour", "month", "lat", "lon"]], test_dow], axis = 1)
test_y = test_dat.n_bikes
# predict the target on the test dataset
predict_test = model.predict(test_x)
# MSE on test dataset
mean_squared_error(test_y,predict_test, squared = False)
#> 4.502217132673415
Through the API both R and Python were able to flourish all the while building extensible infrastructure that can be utilized beyond their own team. The API approach enables the R and Python user to extend their tools beyond their direct team without having to adopt a new toolkit.
Adopting APIs for cross language collaboration
While data scientists may usually think of APIs as something that they use to interact with SaaS products or extract data, they are also a tool that can be utilized to build out the data science infrastructure of a team. Through Flask, Plumber, and other libraries that turn code into RESTful APIs, data scientists can bridge language divides with exceptional ease. I think we ought to begin to transition the ways in which we think about language divides. We ought to utilize the universal language of HTTP more thoroughly. By creating these APIs we not only can aid other data scientists, but entirely other teams. A React JS web development can then tap into your API to either serve up predictions, extract data, send files, or whatever else you can dream up. Let’s not limit ourselves to one language. Let’s build out APIs to enable all languages to thrive.
Disclaimer: This is a personal opinion and not endorsed or published by RStudio. My statements represent no one but myself—sometimes not even that.