Medium Data and Production API Pipeline

“[P]arsing huge json strings is difficult and inefficient.”1 If you have an API that needs to receive a large amount of json, sending that over will be slow.

Q: How can we improve that? A: Compression.

Background

An API is an application programming interface. APIs are how machines talk to other machines. APIs are useful because they are language agnostic meaning that the same API request from Python, or R, or JavaScript will work and return the same results. To send data to an API we use a POST request. The data that we send is usually required to be in json format.

Context

Problem: With large data API POST requests can become extremely slow and take up a lot of storage space. This can cause a bottleneck.

Solution: Compress your data and send a file instead of sending plain text json.

Standard approach

Interacting with an API from R is usually done with the {httr} package. Imagine you want to send a dataframe to an API as json. We can do that by using the httr::POST(), providing a dataframe to the body, and encoding it to json by setting encode = "json".

First let’s load our libraries:

library(httr)          # interacts with apis
library(jsonlite)      # works with json (for later)
library(nycflights13)  # data for posting 

Next, let’s create a sample POST() request to illustrate how posting a dataframe as json works.

b_url <- "http://httpbin.org/post" # an easy to work with sample API POST endpoint

POST(url = b_url, 
     body = list(x = cars),
     encode = "json")
## Response [http://httpbin.org/post]
##   Date: 2020-09-05 23:53
##   Status: 200
##   Content-Type: application/json
##   Size: 4.81 kB
## {
##   "args": {}, 
##   "data": "{\"x\":[{\"speed\":4,\"dist\":2},{\"speed\":4,\"dist\":10},{\"spee...
##   "files": {}, 
##   "form": {}, 
##   "headers": {
##     "Accept": "application/json, text/xml, application/xml, */*", 
##     "Accept-Encoding": "deflate, gzip", 
##     "Content-Length": "1150", 
##     "Content-Type": "application/json", 
## ...

Alternative approach

An alternative approach would be to write our dataframe as json to a compressed gzip file. The process will be to:

  1. Create a temporary file which will store our compressed json.
  2. Create a gzip file connection to write the temporary file as a gzip.
  3. Upload the temporary file to the API.
  4. Remove the temporary file.

Writing to a temporary gzipped file looks like:

# create the tempfile 
tmp <- tempfile()

# create a gzfile connection (to enable writing gz)
gz_tmp <- gzfile(tmp)

# write json to the gz file connection
write_json(cars, gz_tmp)

# close the gz file connection
close(gz_tmp)

Let’s read the temp file to see what it contains.

# read the temp file for illustration 
readLines(tmp)
## [1] "[{\"speed\":4,\"dist\":2},{\"speed\":4,\"dist\":10},{\"speed\":7,\"dist\":4},{\"speed\":7,\"dist\":22},{\"speed\":8,\"dist\":16},{\"speed\":9,\"dist\":10},{\"speed\":10,\"dist\":18},{\"speed\":10,\"dist\":26},{\"speed\":10,\"dist\":34},{\"speed\":11,\"dist\":17},{\"speed\":11,\"dist\":28},{\"speed\":12,\"dist\":14},{\"speed\":12,\"dist\":20},{\"speed\":12,\"dist\":24},{\"speed\":12,\"dist\":28},{\"speed\":13,\"dist\":26},{\"speed\":13,\"dist\":34},{\"speed\":13,\"dist\":34},{\"speed\":13,\"dist\":46},{\"speed\":14,\"dist\":26},{\"speed\":14,\"dist\":36},{\"speed\":14,\"dist\":60},{\"speed\":14,\"dist\":80},{\"speed\":15,\"dist\":20},{\"speed\":15,\"dist\":26},{\"speed\":15,\"dist\":54},{\"speed\":16,\"dist\":32},{\"speed\":16,\"dist\":40},{\"speed\":17,\"dist\":32},{\"speed\":17,\"dist\":40},{\"speed\":17,\"dist\":50},{\"speed\":18,\"dist\":42},{\"speed\":18,\"dist\":56},{\"speed\":18,\"dist\":76},{\"speed\":18,\"dist\":84},{\"speed\":19,\"dist\":36},{\"speed\":19,\"dist\":46},{\"speed\":19,\"dist\":68},{\"speed\":20,\"dist\":32},{\"speed\":20,\"dist\":48},{\"speed\":20,\"dist\":52},{\"speed\":20,\"dist\":56},{\"speed\":20,\"dist\":64},{\"speed\":22,\"dist\":66},{\"speed\":23,\"dist\":54},{\"speed\":24,\"dist\":70},{\"speed\":24,\"dist\":92},{\"speed\":24,\"dist\":93},{\"speed\":24,\"dist\":120},{\"speed\":25,\"dist\":85}]"

POSTing a file

To post a file we use the function httr::upload_file(). The argument we provide is the path, in this case the file path is stored in the tmp object.

POST(b_url, body = list(x = upload_file(tmp)))
## Response [http://httpbin.org/post]
##   Date: 2020-09-05 23:53
##   Status: 200
##   Content-Type: application/json
##   Size: 870 B
## {
##   "args": {}, 
##   "data": "", 
##   "files": {
##     "x": "data:text/plain;base64,H4sIAAAAAAAAA4XSPQ6DMAwF4L3HyMyQ+C8JV6m6wdCt...
##   }, 
##   "form": {}, 
##   "headers": {
##     "Accept": "application/json, text/xml, application/xml, */*", 
##     "Accept-Encoding": "deflate, gzip", 
## ...

Comparing R object to gzip

Now, you may be asking, is this really that big of a difference? It actually is. If you’ll notice from the first response where we POSTed the cars dataframe the response size was 4.81kB. This response with the compressed file was only 870B. Thats a whole lot smaller.

We can compare the object size to the file size for another look. The below is in bytes.

cat(" cars: ", object.size(cars), "\n",
    "compressed cars: ", file.size(tmp))
##  cars:  1648 
##  compressed cars:  210

Benchmarking

Let’s extend this example to some larger datasets as well as benchmark the results. We’ll use data from nycflights13. In this example we’ll send two dataset to an API as the parameters metadata and data. Generally metadata is smaller than the data. So for this example we’ll send 1,000 rows as the metadata and 10,000 rows as the data. We’ll call on the weather and flights datasets from nycflights13.

small_weather <- dplyr::sample_n(weather, 1000)
small_flights <- dplyr::sample_n(flights, 10000)

Making it functional

As always, I recommend making your repetitive tasks into functions. Here we will create two functions. One for posting the data as gzip files and the second as pure json. These will be labeled post_gz() and post_json() respectively.

These functions will take two parameters: metadata and data.

Define post_gz()

post_gz <- function(metadata, data) {
  
  # write metadata to temp file
  tmp_meta <- tempfile("metadata")
  gz_temp_meta <- gzfile(tmp_meta)
  write_json(metadata, gz_temp_meta)
  close(gz_temp_meta)
  
  # write data to temp file
  tmp_data <- tempfile("data")
  gz_temp_data <- gzfile(tmp_data)
  write_json(data, gz_temp_data)
  close(gz_temp_data)
  
  # post 
  q <- POST(b_url, 
       body = list(
         metadata = upload_file(tmp_meta),
         data = upload_file(tmp_data)
       ))
  
  # remove temp files
  unlink(tmp_meta)
  unlink(tmp_data)
  
  # return a character for purposes of bench marking
  "Posted..."
}

Define post_json().

post_json <- function(metadata, data) {
  q <- POST(b_url, 
       body = list(
         metadata = metadata,
         data = data
       ),
       encode = "json") 
  
  "Posted..."
}

Now that these functions have been defined, let’s compare their performance using the package bench. We’ll run each function 50 times to get a good understanding of their respective performance.

bm <- bench::mark(
  post_gz(small_weather, small_flights),
  post_json(small_weather, small_flights),
  iterations = 50
  )

bm
## # A tibble: 2 x 6
##   expression                                 min median `itr/sec` mem_alloc
##   <bch:expr>                              <bch:> <bch:>     <dbl> <bch:byt>
## 1 post_gz(small_weather, small_flights)    1.01s  2.11s    0.315     14.8MB
## 2 post_json(small_weather, small_flights) 10.52s 19.07s    0.0428    23.1MB
## # … with 1 more variable: `gc/sec` <dbl>
ggplot2::autoplot(bm)
## Loading required namespace: tidyr