[Not so] generic functions

Lately I have been doing more of my spatial analysis work in R with the help of the sf package. One shapefile I was working with had some horrendously named columns, and naturally, I tried to clean them using the clean_names() function from the janitor package. But lo, an egregious error occurred. To this end, I officially filed my complaint as an issue. The solution presented was to simply create a method for sf objects.

Yeah, methods, how tough can those be? Apparently the process isn’t at all difficult. But figuring out the process? That was difficult. This post will explain how I went about the process for converting the clean_names() function into a generic (I’ll explain this in a second), and creating a method for sf and tbl_graph objects.

The Jargon

Okay, I want to address the jargon. What the hell is a generic function, and what is a method? But first, I want to give a quick tl;dr on what a function is. I define as function as bit of code that takes an input, changes it in some way, and produces an output. Even simpler, a function takes an input and creates an output.

Generic Functions

Now, what is a generic function? My favorite definition that I’ve seen so far comes from LispWorks Ltd (their website is a historic landmark, I recommend you give it a look for a reminder of what the internet used to be). They define a generic function as

a function whose behavior depends on the classes or identities of the arguments supplied to it.

This means that we have to create a function that looks at the class of an object and perform an operation based on the object class. That means if there is "numeric" or "list" object, they will be treated differently. These are called methods. Note: you can find the class of an object by using the class() function on any object.

Methods

To steal from LispWorks Ltd again, a method is

part of a generic function which provides information about how that generic function should behave [for] certain classes.

This means that a method is part of a generic function and has to be defined separately. Imagine we have a generic function called f with methods for list and numeric objects. The way that we would denote these methods is by putting a period after the function name and indicating the type of object the function is to be used on. These would look like f.list and f.numeric respectively.

But to save time you can always create a default method which will be dispatched (used) on any object that it hasn’t been explicitly told how to operate on (by a specific method).

Now that the intuition of what generic functions and methods R, we can begin the work of actually creating them. This tutorial will walk through the steps I took in changing the clean_names() from a standard function into a generic function with methods for sf objects and tbl_graph objects from the sf and tidygraph packages respectively.

A brief overview of the process:

  1. Define the generic function
  2. Create a default method
  3. Create additional methods

A quick note: The code that follows is not identical to that of the package. I will be changing it up to make it simpler to read and understand what is happening.

The Generic Method

The first step, as described above, is to create a generic function. Generic functions are made by creating a new function with the body containing only a call to the UseMethod() function. The only argument to this is the name of your generic function—this should be the same as the name of the function you are making. This tells R that you are creating a generic function. Additionally, you should add any arguments that will be necessary for your function. Here, there are two arguments: dat and case. These indicate the data to be cleaned and the preferred style for them to be cleaned according to.

I am not setting any default values for dat to make it required, whereas I am setting case to "snake".

clean_names <- function(dat, case = "snake") {
  UseMethod("clean_names")
}

Now we have created a generic function. But this function doesn’t know how to run on any given object types. In other words, there are no methods associated with it. To illustrate this try using the clean_names() function we just defined on objects of different types.

clean_names(1) # numeric 
clean_names("test") # character 
clean_names(TRUE) # logical 
## [1] "no applicable method for 'clean_names' applied to an object of class \"c('double', 'numeric')\""
## [1] "no applicable method for 'clean_names' applied to an object of class \"character\""
## [1] "no applicable method for 'clean_names' applied to an object of class \"logical\""

The output of these calls say no applicable method for 'x' applied to an object of [class]. In order to prevent this from happening, we can create a default method. A default method will always be used if the function doesn’t have a method for the provided object type.

The Default Method

Remember that methods are indicated by writing function.method. It is also important to note that the method should indicate an object class. To figure out what class an object is you can use the class() function. For example class(1) tells you that the number 1 is “numeric”.

In this next step I want to create a default method that will be used on every object that there isn’t a method explicitly for. To do this I will create a function called clean_names.default.

As background, the clean_names() function takes a data frame and changes column headers to fit a given style. clean_names() in the development version is based on the function make_clean_names() which takes a character vector and makes each value match a given style (the default is snake, and you should only use snake case because everything else is wrong * sarcasm * ).

To prevent us from loading the entire janitor package and overwriting our version of the clean_names() function, we can import the make_clean_names() function directly from GitHub by reading the file directly.

source("https://raw.githubusercontent.com/sfirke/janitor/master/R/make_clean_names.R")

Now let’s see how this function works. For this we will use the ugliest character vector I have ever seen from the tests for clean_names() (h/t @sfirke for making this).

ugly_names <- c(
  "sp ace", "repeated", "a**^@", "%", "*", "!",
  "d(!)9", "REPEATED", "can\"'t", "hi_`there`", "  leading spaces",
  "€", "ação", "Farœ", "a b c d e f", "testCamelCase", "!leadingpunct",
  "average # of days", "jan2009sales", "jan 2009 sales"
)

ugly_names
##  [1] "sp ace"            "repeated"          "a**^@"            
##  [4] "%"                 "*"                 "!"                
##  [7] "d(!)9"             "REPEATED"          "can\"'t"          
## [10] "hi_`there`"        "  leading spaces"  "€"                
## [13] "ação"              "Farœ"              "a b c d e f"      
## [16] "testCamelCase"     "!leadingpunct"     "average # of days"
## [19] "jan2009sales"      "jan 2009 sales"

Now to see how this function works:

make_clean_names(ugly_names)
##  [1] "sp_ace"                 "repeated"              
##  [3] "a"                      "percent"               
##  [5] "x"                      "x_2"                   
##  [7] "d_9"                    "repeated_2"            
##  [9] "cant"                   "hi_there"              
## [11] "leading_spaces"         "x_3"                   
## [13] "acao"                   "faroe"                 
## [15] "a_b_c_d_e_f"            "test_camel_case"       
## [17] "leadingpunct"           "average_number_of_days"
## [19] "jan2009sales"           "jan_2009_sales"

Très magnifique!

The body of the default method will take column names from a dataframe, clean them, and reassign them. Before we can do this, a dataframe is needed!

# create a data frame with 20 columns
test_df <- as_tibble(matrix(sample(100, 20), ncol = 20))
## Warning: `as_tibble.matrix()` requires a matrix with column names or a `.name_repair` argument. Using compatibility `.name_repair`.
## This warning is displayed once per session.
# makes the column names the `ugly_names` vector
names(test_df) <- ugly_names

# print the data frame.
test_df
## # A tibble: 1 x 20
##   `sp ace` repeated `a**^@`   `%`   `*`   `!` `d(!)9` REPEATED `can"'t`
##      <int>    <int>   <int> <int> <int> <int>   <int>    <int>    <int>
## 1       68       14      47   100    51    74      37        1       94
## # … with 11 more variables: `hi_\`there\`` <int>, ` leading spaces` <int>,
## #   `€` <int>, ação <int>, Farœ <int>, `a b c d e f` <int>,
## #   testCamelCase <int>, `!leadingpunct` <int>, `average # of days` <int>,
## #   jan2009sales <int>, `jan 2009 sales` <int>

The process for writing this function is:

  • take a dataframe
  • take the old column names and clean them
  • reassign the column names as the new clean names
  • return the object
clean_names.default <- function(dat, case = "snake") { 
  # retrieve the old names
  old_names <- names(dat)
  # clean the old names
  new_names <- make_clean_names(old_names, case = case)
  # assign the column names as the clean names vector
  names(dat) <- new_names
  # return the data
  return(dat)
  }

Now that the default method has been defined. Try running the function on our test dataframe!

clean_names(test_df)
## # A tibble: 1 x 20
##   sp_ace repeated     a percent     x   x_2   d_9 repeated_2  cant hi_there
##    <int>    <int> <int>   <int> <int> <int> <int>      <int> <int>    <int>
## 1     68       14    47     100    51    74    37          1    94       45
## # … with 10 more variables: leading_spaces <int>, x_3 <int>, acao <int>,
## #   faroe <int>, a_b_c_d_e_f <int>, test_camel_case <int>,
## #   leadingpunct <int>, average_number_of_days <int>, jan2009sales <int>,
## #   jan_2009_sales <int>

Oh, my gorsh. Look at that! We can try replicating this with a named vector to see how the default method dispatched on unknown objects!

# create a vector with 20 elements
test_vect <- c(1:20)

# name each element with the ugly_names vector 
names(test_vect) <- ugly_names

# try cleaning!
clean_names(test_vect)
##                 sp_ace               repeated                      a 
##                      1                      2                      3 
##                percent                      x                    x_2 
##                      4                      5                      6 
##                    d_9             repeated_2                   cant 
##                      7                      8                      9 
##               hi_there         leading_spaces                    x_3 
##                     10                     11                     12 
##                   acao                  faroe            a_b_c_d_e_f 
##                     13                     14                     15 
##        test_camel_case           leadingpunct average_number_of_days 
##                     16                     17                     18 
##           jan2009sales         jan_2009_sales 
##                     19                     20

It looks like this default function works super well with named objects! Now, we will broach the problem I started with, sf objects.

sf method

This section will go over the process for creating the sf method. If you have not ever used the sf package, I suggest you give it a try! It makes dataframe objects with spatial data associated with it. This allows you to perform many of the functions from the tidyverse to spatial data.

Before getting into it, I want to create a test object to work with. I will take the test_df column, create longitude and latitude columns, and then convert it into an sf object. The details of sf objects is out of the scope of this post.

library(sf)
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
test_sf <- test_df %>%
  # create xy columns
  mutate(long = -80, 
         lat = 40) %>% 
  # convert to sf object 
  st_as_sf(coords = c("long", "lat"))

# converting geometry column name to poor style
names(test_sf)[21] <- "Geometry"

# telling sf which column is now the geometry
st_geometry(test_sf) <- "Geometry"

test_sf
## Simple feature collection with 1 feature and 20 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -80 ymin: 40 xmax: -80 ymax: 40
## epsg (SRID):    NA
## proj4string:    NA
## # A tibble: 1 x 21
##   `sp ace` repeated `a**^@`   `%`   `*`   `!` `d(!)9` REPEATED `can"'t`
##      <int>    <int>   <int> <int> <int> <int>   <int>    <int>    <int>
## 1       68       14      47   100    51    74      37        1       94
## # … with 12 more variables: `hi_\`there\`` <int>, ` leading spaces` <int>,
## #   `€` <int>, ação <int>, Farœ <int>, `a b c d e f` <int>,
## #   testCamelCase <int>, `!leadingpunct` <int>, `average # of days` <int>,
## #   jan2009sales <int>, `jan 2009 sales` <int>, Geometry <POINT>

The sf object has been created. But now how does our default method of the clean_names() function work on this object? There is only one way to know, try it.

clean_names(test_sf)

Error in st_geometry.sf(x) : attr(obj, "sf_column") does not point to a geometry column. Did you rename it, without setting st_geometry(obj) <- "newname"?

Notice how it fails. sf noticed that I changed the name of the geometry column without explicitly telling it I did so. Since the geometry column is almost always the last column of an sf object, we can use the make_clean_names() function on every column but the last one! To do this we will use the rename_at() function from dplyr. This function allows you rename columns based on their name or position, and a function that renames it (in this case, make_clean_names()).

For this example dataset, say I wanted to clean the first column. How would I do that? Note that the first column is called sp ace. How this works can be seen in a simple example. In the below function call we are using the rename_at() function (for more, go here), selecting the first column name, and renaming it using the make_clean_names() function.

rename_at(test_df, .vars = vars(1), .funs = make_clean_names)
## # A tibble: 1 x 20
##   sp_ace repeated `a**^@`   `%`   `*`   `!` `d(!)9` REPEATED `can"'t`
##    <int>    <int>   <int> <int> <int> <int>   <int>    <int>    <int>
## 1     68       14      47   100    51    74      37        1       94
## # … with 11 more variables: `hi_\`there\`` <int>, ` leading spaces` <int>,
## #   `€` <int>, ação <int>, Farœ <int>, `a b c d e f` <int>,
## #   testCamelCase <int>, `!leadingpunct` <int>, `average # of days` <int>,
## #   jan2009sales <int>, `jan 2009 sales` <int>

Notice how only the first column has been cleaned. It went from sp ace to sp_ace. The goal is to replicate this for all columns except the last one.

To write the sf method, the above line of code can be adapted to select columns 1 through the number of columns minus 1 (so geometry isn’t selected). In order to make this work, we need to identify the second to last column—this will be supplied as the ending value of our selected variables.

clean_names.sf <- function(dat, case = "snake") {
  # identify last column that is not geometry
  last_col_to_clean <- ncol(dat) - 1
  # create a new dat object
  dat <- rename_at(dat, 
                   # rename the first up until the second to last
                   .vars = vars(1:last_col_to_clean), 
                   # clean using the make_clean_names
                   .funs = make_clean_names)
  return(dat)
}

Voilà! Our first non-default method has been created. This means that when an sf object is supplied to our generic function clean_names() it looks at the class of the object—class(sf_object)—notices it’s an sf object, then dispatches (uses) the clean_names.sf() method instead of the default.

clean_names(test_sf)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(last_col_to_clean)` instead of `last_col_to_clean` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Simple feature collection with 1 feature and 20 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -80 ymin: 40 xmax: -80 ymax: 40
## epsg (SRID):    NA
## proj4string:    NA
## # A tibble: 1 x 21
##   sp_ace repeated     a percent     x   x_2   d_9 repeated_2  cant hi_there
##    <int>    <int> <int>   <int> <int> <int> <int>      <int> <int>    <int>
## 1     68       14    47     100    51    74    37          1    94       45
## # … with 11 more variables: leading_spaces <int>, x_3 <int>, acao <int>,
## #   faroe <int>, a_b_c_d_e_f <int>, test_camel_case <int>,
## #   leadingpunct <int>, average_number_of_days <int>, jan2009sales <int>,
## #   jan_2009_sales <int>, Geometry <POINT>

Here we see that it worked exactly as we hoped. Every column but the last has been altered. This allows sf to name it’s geometry columns whatever it would like without disrupting it.

Shortly after this addition was added to the package I became aware of another type of object that had problems using clean_names(). This is the tbl_graph object from the tidygraph package from Thomas Lin Pederson.

tbl_graph method

In issue #252 @gvdr noted that calling clean_names() on a tbl_graph doesn’t execute. Thankfully @Tazinho noted that you could easily clean the column headers by using the rename_all() function from dplyr.

Here the solution was even easier than above. As a reminder, in order to make the tbl_graph method, we need to specify the name of the generic followed by the object class.

clean_names.tbl_graph <- function(dat, case = "snake") { 
  # rename all columns
  dat <- rename_all(dat, make_clean_names)
  return(dat)
  }

In order to test the function, we will need a graph to test it on. This example draws on the example used in the issue.

library(tidygraph)
# create test graph to test clean_names
test_graph <- play_erdos_renyi(0, 0.5) %>% 
  # attach test_df as columns 
  bind_nodes(test_df)

test_graph
## # A tbl_graph: 1 nodes and 0 edges
## #
## # A rooted tree
## #
## # Node Data: 1 x 20 (active)
##   `sp ace` repeated `a**^@`   `%`   `*`   `!` `d(!)9` REPEATED `can"'t`
##      <int>    <int>   <int> <int> <int> <int>   <int>    <int>    <int>
## 1       68       14      47   100    51    74      37        1       94
## # … with 11 more variables: `hi_\`there\`` <int>, ` leading spaces` <int>,
## #   `€` <int>, ação <int>, Farœ <int>, `a b c d e f` <int>,
## #   testCamelCase <int>, `!leadingpunct` <int>, `average # of days` <int>,
## #   jan2009sales <int>, `jan 2009 sales` <int>
## #
## # Edge Data: 0 x 2
## # … with 2 variables: from <int>, to <int>

Here we see that there is a graph with only 1 node and 0 edges (relations) with bad column headers (for more, visit the GitHub page). Now we can test this as well.

clean_names(test_graph)
## # A tbl_graph: 1 nodes and 0 edges
## #
## # A rooted tree
## #
## # Node Data: 1 x 20 (active)
##   sp_ace repeated     a percent     x   x_2   d_9 repeated_2  cant hi_there
##    <int>    <int> <int>   <int> <int> <int> <int>      <int> <int>    <int>
## 1     68       14    47     100    51    74    37          1    94       45
## # … with 10 more variables: leading_spaces <int>, x_3 <int>, acao <int>,
## #   faroe <int>, a_b_c_d_e_f <int>, test_camel_case <int>,
## #   leadingpunct <int>, average_number_of_days <int>, jan2009sales <int>,
## #   jan_2009_sales <int>
## #
## # Edge Data: 0 x 2
## # … with 2 variables: from <int>, to <int>

It worked as anticipated!

Review (tl;dr)

In the preceding sections we learned what generic functions and methods are. How to create a generic function, a default method, and methods for objects of different classes.

  • generic function: “A generic function is a function whose behavior depends on the classes or identities of the arguments supplied to it”
  • generic function method: “part of a generic function and which provides information about how that generic function should behave [for] certain classes”

The process to create a function with a method is to:

  1. Create a generic function with:
    • f_x <- function() { UseMethod("f_x") }
  2. Define the default method with:
    • f_x.default <- function() { do something }
  3. Define object class specific methods with:
    • f_x.class <- function() { do something else}

Notes

If you have not yet encountered the janitor package it will help you tremendously with various data cleaning processes. Clearly, clean_names() is my favorite function as it helps me enforce my preferred style (and the only). If you are not aware of “proper” R style, I suggest you read the style guide in Advanced R.

While on the subject of Advanced R, I suggest you read the “Creating new methods and generics” section of it. I struggled comprehending it at first because I didn’t even know what a method was. However, if after reading this you feel like you want more, that’s the place to go.

I’d like to thank @sfirke for being exceptionally helpful in guiding my contributions to the janitor package.