Medium Data and Production API Pipeline
“[P]arsing huge json strings is difficult and inefficient.”1 If you have an API that needs to receive a large amount of json, sending that over will be slow.
Q: How can we improve that? A: Compression.
Background
An API is an application programming interface. APIs are how machines talk to other machines. APIs are useful because they are language agnostic meaning that the same API request from Python, or R, or JavaScript will work and return the same results. To send data to an API we use a POST request. The data that we send is usually required to be in json format.
Context
Problem: With large data API POST requests can become extremely slow and take up a lot of storage space. This can cause a bottleneck.
Solution: Compress your data and send a file instead of sending plain text json.
Standard approach
Interacting with an API from R is usually done with the {httr}
package. Imagine you want to send a dataframe to an API as json. We can do that by using the httr::POST()
, providing a dataframe to the body, and encoding it to json by setting encode = "json"
.
First let’s load our libraries:
library(httr) # interacts with apis
library(jsonlite) # works with json (for later)
library(nycflights13) # data for posting
Next, let’s create a sample POST()
request to illustrate how posting a dataframe as json works.
b_url <- "http://httpbin.org/post" # an easy to work with sample API POST endpoint
POST(url = b_url,
body = list(x = cars),
encode = "json")
## Response [http://httpbin.org/post]
## Date: 2020-09-05 23:53
## Status: 200
## Content-Type: application/json
## Size: 4.81 kB
## {
## "args": {},
## "data": "{\"x\":[{\"speed\":4,\"dist\":2},{\"speed\":4,\"dist\":10},{\"spee...
## "files": {},
## "form": {},
## "headers": {
## "Accept": "application/json, text/xml, application/xml, */*",
## "Accept-Encoding": "deflate, gzip",
## "Content-Length": "1150",
## "Content-Type": "application/json",
## ...
Alternative approach
An alternative approach would be to write our dataframe as json to a compressed gzip file. The process will be to:
- Create a temporary file which will store our compressed json.
- Create a gzip file connection to write the temporary file as a gzip.
- Upload the temporary file to the API.
- Remove the temporary file.
Writing to a temporary gzipped file looks like:
# create the tempfile
tmp <- tempfile()
# create a gzfile connection (to enable writing gz)
gz_tmp <- gzfile(tmp)
# write json to the gz file connection
write_json(cars, gz_tmp)
# close the gz file connection
close(gz_tmp)
Let’s read the temp file to see what it contains.
# read the temp file for illustration
readLines(tmp)
## [1] "[{\"speed\":4,\"dist\":2},{\"speed\":4,\"dist\":10},{\"speed\":7,\"dist\":4},{\"speed\":7,\"dist\":22},{\"speed\":8,\"dist\":16},{\"speed\":9,\"dist\":10},{\"speed\":10,\"dist\":18},{\"speed\":10,\"dist\":26},{\"speed\":10,\"dist\":34},{\"speed\":11,\"dist\":17},{\"speed\":11,\"dist\":28},{\"speed\":12,\"dist\":14},{\"speed\":12,\"dist\":20},{\"speed\":12,\"dist\":24},{\"speed\":12,\"dist\":28},{\"speed\":13,\"dist\":26},{\"speed\":13,\"dist\":34},{\"speed\":13,\"dist\":34},{\"speed\":13,\"dist\":46},{\"speed\":14,\"dist\":26},{\"speed\":14,\"dist\":36},{\"speed\":14,\"dist\":60},{\"speed\":14,\"dist\":80},{\"speed\":15,\"dist\":20},{\"speed\":15,\"dist\":26},{\"speed\":15,\"dist\":54},{\"speed\":16,\"dist\":32},{\"speed\":16,\"dist\":40},{\"speed\":17,\"dist\":32},{\"speed\":17,\"dist\":40},{\"speed\":17,\"dist\":50},{\"speed\":18,\"dist\":42},{\"speed\":18,\"dist\":56},{\"speed\":18,\"dist\":76},{\"speed\":18,\"dist\":84},{\"speed\":19,\"dist\":36},{\"speed\":19,\"dist\":46},{\"speed\":19,\"dist\":68},{\"speed\":20,\"dist\":32},{\"speed\":20,\"dist\":48},{\"speed\":20,\"dist\":52},{\"speed\":20,\"dist\":56},{\"speed\":20,\"dist\":64},{\"speed\":22,\"dist\":66},{\"speed\":23,\"dist\":54},{\"speed\":24,\"dist\":70},{\"speed\":24,\"dist\":92},{\"speed\":24,\"dist\":93},{\"speed\":24,\"dist\":120},{\"speed\":25,\"dist\":85}]"
POSTing a file
To post a file we use the function httr::upload_file()
. The argument we provide is the path, in this case the file path is stored in the tmp
object.
POST(b_url, body = list(x = upload_file(tmp)))
## Response [http://httpbin.org/post]
## Date: 2020-09-05 23:53
## Status: 200
## Content-Type: application/json
## Size: 870 B
## {
## "args": {},
## "data": "",
## "files": {
## "x": "data:text/plain;base64,H4sIAAAAAAAAA4XSPQ6DMAwF4L3HyMyQ+C8JV6m6wdCt...
## },
## "form": {},
## "headers": {
## "Accept": "application/json, text/xml, application/xml, */*",
## "Accept-Encoding": "deflate, gzip",
## ...
Comparing R object to gzip
Now, you may be asking, is this really that big of a difference? It actually is. If you’ll notice from the first response where we POSTed the cars
dataframe the response size was 4.81kB. This response with the compressed file was only 870B. Thats a whole lot smaller.
We can compare the object size to the file size for another look. The below is in bytes.
cat(" cars: ", object.size(cars), "\n",
"compressed cars: ", file.size(tmp))
## cars: 1648
## compressed cars: 210
Benchmarking
Let’s extend this example to some larger datasets as well as benchmark the results. We’ll use data from nycflights13
. In this example we’ll send two dataset to an API as the parameters metadata
and data
. Generally metadata is smaller than the data. So for this example we’ll send 1,000 rows as the metadata and 10,000 rows as the data. We’ll call on the weather
and flights
datasets from nycflights13
.
small_weather <- dplyr::sample_n(weather, 1000)
small_flights <- dplyr::sample_n(flights, 10000)
Making it functional
As always, I recommend making your repetitive tasks into functions. Here we will create two functions. One for posting the data as gzip files and the second as pure json. These will be labeled post_gz()
and post_json()
respectively.
These functions will take two parameters: metadata
and data
.
Define post_gz()
post_gz <- function(metadata, data) {
# write metadata to temp file
tmp_meta <- tempfile("metadata")
gz_temp_meta <- gzfile(tmp_meta)
write_json(metadata, gz_temp_meta)
close(gz_temp_meta)
# write data to temp file
tmp_data <- tempfile("data")
gz_temp_data <- gzfile(tmp_data)
write_json(data, gz_temp_data)
close(gz_temp_data)
# post
q <- POST(b_url,
body = list(
metadata = upload_file(tmp_meta),
data = upload_file(tmp_data)
))
# remove temp files
unlink(tmp_meta)
unlink(tmp_data)
# return a character for purposes of bench marking
"Posted..."
}
Define post_json()
.
post_json <- function(metadata, data) {
q <- POST(b_url,
body = list(
metadata = metadata,
data = data
),
encode = "json")
"Posted..."
}
Now that these functions have been defined, let’s compare their performance using the package bench
. We’ll run each function 50 times to get a good understanding of their respective performance.
bm <- bench::mark(
post_gz(small_weather, small_flights),
post_json(small_weather, small_flights),
iterations = 50
)
bm
## # A tibble: 2 x 6
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:> <bch:> <dbl> <bch:byt>
## 1 post_gz(small_weather, small_flights) 1.01s 2.11s 0.315 14.8MB
## 2 post_json(small_weather, small_flights) 10.52s 19.07s 0.0428 23.1MB
## # … with 1 more variable: `gc/sec` <dbl>
ggplot2::autoplot(bm)
## Loading required namespace: tidyr