Finding an SPSS {haven}

My education as a social scientist—undergratuate studies in sociology and anthropology—was largely focused on theory and the application of theory to social problems. For the most part, I taught myself how to apply those methods through R. I was fortunate enough to have avoided ever using SPSS. Perhaps that is good. Perhaps it is not. The use of R in the social sciences is increasing and I will go as far as to say that that is great news. However, there are still holdouts.

Very recently I came across data exported from SPSS in the wild. In the work that I get to engage in at the Boston Area Research Initiative (BARI) we receive data from various sources, whether these be municipal organizations, the police department, or non-profits, and as Tolstoy said:

In this post I want to illustrate two of the main pain points I encountered while working with data from SPSS.I also go rather deep into the weeds about R data structures and how we can manipulate them. While this is “about” SPSS data, folks looking for a better understanding of R object may benefit from this as well (specifically Gripe 2).

{haven}

I am not sure where the inspiration for the name “haven” came from. But I am sure that its name does indeed speak for itself. haven enables R programmers that are SPSS—Stata and SAS as well—illiterate to work with data that is not naturally intended for R use. There are two key behaviors of SPSS data that I was unaware of and that plagued me. I break these down into three gripes below.

Gripes

I maintain three gripes about SPSS data.

  1. Factor values are represented numerically and the values that they represent are stored in metadata.
  2. Column names are vague. The values they represent are stored as a label (metadata).
  3. Missing values can be represented innumerably. Each column can have user defined missing values, i.e. 9999.

Now, I must say that I have found it best practice to try and combat my own gripes.

In regards to the former two gripes, this is not unheard of behavior nor is it rare. Associating numeric values with related values—oh, I don’t know…think of a user ID and an email address—is in essence the core of relational database systems (RDBMS). One may even have the gall to argue that RDBMS power many of the tools I use and take for granted. I would most likely be willing to concede that point.

The third gripe can be quite easily countered if I am to be frank. Missing data is in of itself data. An NA is an NA is an NA may not be a generalizable statement. Providing values such as 9999 in place of a missing value in some cases may be a relic of antiquity where missingness could not be handled by software. Or, perhaps we can frame missingness other ways. Let’s imagine we are conducting a study and we want to keep track of why there was missing data. This could have been from a non-response, or a withdrawal, or erroneously entered data, or any other number of reasons. Keeping a record of that missing data may useful.

Gripe 1: numbers representing characters (or labels)

Sometimes I really would like stringsAsFactors = TRUE. Working with survey data tends to be one of those times. R has a robust method of creating, handling, and manipulating factors1 and because of this, we aren’t required to numerically encode our data. This may be a personal preference, but I really like to be reminded of what I am working with and seeing the factor levels clearly spelled out for me is quite nice.

Since the data I am working with at BARI is confidential, I’ve found some SPSS data hosted by the University of Bath2 to illustrate this with.

Reading in data is rather straightforward. SPSS data come with a .sav file extension. Following standard readr convention we can read a .sav file with haven::read_sav().

Note: the syntax above is used for referencing an exported function from a namespace (package name). The syntax is pkgname::function().

The below line reads in the sav file from the University of Bath.

Note: by wrapping an object assignment expression, such as the below, in parentheses the object is then printed (I just recently figured this out).

library(haven)

# read in sav file
(noise <- haven::read_sav("http://staff.bath.ac.uk/pssiw/stats2/noisedata.sav"))
## # A tibble: 20 x 6
##    SUBJECT     GENDER  NONE   LOW MEDIUM  HIGH
##      <dbl>  <dbl+lbl> <dbl> <dbl>  <dbl> <dbl>
##  1       1 1 [male]      46    55     68    44
##  2       2 1 [male]      44    61     72    51
##  3       3 1 [male]      51    59     79    38
##  4       4 1 [male]      34    45     51    37
##  5       5 1 [male]      55    58     45    45
##  6       6 1 [male]      29    64     74    39
##  7       7 1 [male]      34    60     79    46
##  8       8 1 [male]      44    57     66    51
##  9       9 1 [male]      50    61     59    40
## 10      10 1 [male]      41    63     68    50
## 11      11 2 [female]    46    58     72    47
## 12      12 2 [female]    47    55     74    55
## 13      13 2 [female]    38    44     70    46
## 14      14 2 [female]    38    47     61    53
## 15      15 2 [female]    50    20     58    51
## 16      16 2 [female]    46    51     74    42
## 17      17 2 [female]    44    54     78    46
## 18      18 2 [female]    38    48     69    51
## 19      19 2 [female]    40    50     70    47
## 20      20 2 [female]    46    52     66    41

The above shows GENDER codes as a numeric value. But if you print out the tibble to your console you will see labels. So in this case, where there is a 1 under GENDER, the printed console shows [male], and the same is true for 2 and [female]. We can get a sense of this by viewing the structure of the tibble.

str(noise)
## Classes 'tbl_df', 'tbl' and 'data.frame':    20 obs. of  6 variables:
##  $ SUBJECT: num  1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "format.spss")= chr "F8.2"
##  $ GENDER : 'haven_labelled' num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "format.spss")= chr "F8.2"
##   ..- attr(*, "labels")= Named num  1 2
##   .. ..- attr(*, "names")= chr  "male" "female"
##  $ NONE   : num  46 44 51 34 55 29 34 44 50 41 ...
##   ..- attr(*, "format.spss")= chr "F8.2"
##  $ LOW    : num  55 61 59 45 58 64 60 57 61 63 ...
##   ..- attr(*, "format.spss")= chr "F8.2"
##  $ MEDIUM : num  68 72 79 51 45 74 79 66 59 68 ...
##   ..- attr(*, "format.spss")= chr "F8.2"
##  $ HIGH   : num  44 51 38 37 45 39 46 51 40 50 ...
##   ..- attr(*, "format.spss")= chr "F8.2"

Above we can see that GENDER has an attribute labels which is a named numeric vector. The unique values are 1 and 2 representing “male” and “female” respectively. I struggle to keep this mental association. I’d prefer to have this shown explicitly. Fortunately, haven provides the function haven::as_factor() which will convert these pesky integer columns to their respective factor values. We just need to pass the data frame as the only argument—sure, there are other arguments if you want to get fancy.

# selecting just 2 columns and 2 rows for simlicity.
small_lil_df <- slice(noise, 1, 20) %>% 
  select(1, 2)

# convert coded response to response text
haven::as_factor(small_lil_df)
## # A tibble: 2 x 2
##   SUBJECT GENDER
##     <dbl> <fct> 
## 1       1 male  
## 2      20 female

And now we can just forget those integers ever existed in the firstplace!

Gripe 2: uninformative column names

There are three key rules to tidy data.

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Each variable forms a column. Got it. So this means that any thing that can help describe our observation should be a column. Say we have a table of survey respondents. In this case each row should be a respondent and each column should be a variable (or feature, or predictor, or x, or whatever) associate with that respondent. This could be something like age, birth date, or the respondents response to a survey question.

In the tidyverse style guide Hadley Wickham writes

Generally, variable names should be nouns and function names should be verbs. Strive for names that are concise and meaningful (this is not easy!).

I personaly try to extend this to column names as well. Feature names are important so I, as a researcher, can remember what is what. From my encounters with SPSS data, I’ve found that feature names can be rather uninformative e.g. “Q12.” Much like factors, columns may have associated information hidden somewhere within them.

We read in the personality data set from the Universit of Bath below.

# read in sav file with column labels
(personality <- haven::read_sav("http://staff.bath.ac.uk/pssiw/stats2/personality.sav"))
## # A tibble: 459 x 44
##    PERS01 PERS02 PERS03 PERS04 PERS05 PERS06 PERS07 PERS08 PERS09 PERS10
##     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1      5      4      5      1      4      3      3      1      2      3
##  2      1      1      5      2      1      2      5      1      5      1
##  3      4      1      5      3      3      4      5      3      1      4
##  4      4      2      5      1      4      3      4      4      4      5
##  5      2      3      5      1      2      4      5      2      3      3
##  6      1      1      5      4      3      4      4      2      1      4
##  7      3      2      5      1      2      1      1      2      5      4
##  8      5      2      4      2      4      1      4      3      3      5
##  9      5      1      4      3      2      1      4      4      2      3
## 10      4      1      5      1      4      3      4      1      5      4
## # … with 449 more rows, and 34 more variables: PERS11 <dbl>, PERS12 <dbl>,
## #   PERS13 <dbl>, PERS14 <dbl>, PERS15 <dbl>, PERS16 <dbl>, PERS17 <dbl>,
## #   PERS18 <dbl>, PERS19 <dbl>, PERS20 <dbl>, PERS21 <dbl>, PERS22 <dbl>,
## #   PERS23 <dbl>, PERS24 <dbl>, PERS25 <dbl>, PERS26 <dbl>, PERS27 <dbl>,
## #   PERS28 <dbl>, PERS29 <dbl>, PERS30 <dbl>, PERS31 <dbl>, PERS32 <dbl>,
## #   PERS33 <dbl>, PERS34 <dbl>, PERS35 <dbl>, PES36 <dbl>, PERS37 <dbl>,
## #   PERS38 <dbl>, PERS39 <dbl>, PERS40 <dbl>, PERS41 <dbl>, PERS42 <dbl>,
## #   PERS43 <dbl>, PERS44 <dbl>

The first thing that I notice is rather apalling: each column represents a person. oof, untidy. But this issue isn’t what brought me to these data. If you print the data to the console, you see something similar as what is above. If you view (View(df)) the data, the story is different. There is associated information underneath each column header.

str(personality[,1:5])
## Classes 'tbl_df', 'tbl' and 'data.frame':    459 obs. of  5 variables:
##  $ PERS01: num  5 1 4 4 2 1 3 5 5 4 ...
##   ..- attr(*, "label")= chr "talkative"
##   ..- attr(*, "format.spss")= chr "F1.0"
##  $ PERS02: num  4 1 1 2 3 1 2 2 1 1 ...
##   ..- attr(*, "label")= chr "finds fault"
##   ..- attr(*, "format.spss")= chr "F1.0"
##  $ PERS03: num  5 5 5 5 5 5 5 4 4 5 ...
##   ..- attr(*, "label")= chr "does a thorough job"
##   ..- attr(*, "format.spss")= chr "F1.0"
##  $ PERS04: num  1 2 3 1 1 4 1 2 3 1 ...
##   ..- attr(*, "label")= chr "depressed"
##   ..- attr(*, "format.spss")= chr "F1.0"
##  $ PERS05: num  4 1 3 4 2 3 2 4 2 4 ...
##   ..- attr(*, "label")= chr "original"
##   ..- attr(*, "format.spss")= chr "F1.0"

Yikes. These labels seem to be character judgements! Whatever the labels represent, I want them, and I want them as the column headers.

From looking at the structure of the data frame we can glean that each column has a label.

Warning: I’m going to walk through a fair bit of the theory and under the hood work to make these labels column names. Scroll to the bottom of this section to find the function definition.

Aside: R theory:

  • Each column of a data frame is actually just a vector. Each vector can have it’s own attributes (as above).
  • A data frame is actually just a list of equal length vectors (same number of observations).3
  • “All objects can have arbitrary additional attributes, used to store metadata about the object.”4
  • We can fetch list elements using [[ notation, e.g. my_list[[1]]
    • purrr::pluck() is an alternative to using [[ for grabbing the underlying elements inside of a data structure. This means we can use pluck(my_list, 1) in place of my_list[[1]]

Okay, but how does one actually get the label from the vector? The first step is to actually grab the vector. Below I use purrr::pluck()to pull the first column. Note that slice() is used for grabing specific row indexes. The below code is equivalent to personality[1:10,][[1]]. I prefer using the purrr functions because they are more legible.

col_1 <- slice(personality, 1:10) %>% 
  pluck(1)

We can access all of the vectors attributes with, you guessed it, the attributes() function.

attributes(col_1)
## $label
## [1] "talkative"
## 
## $format.spss
## [1] "F1.0"

This returns a named list. We can access (or set) specific attributes using the attr() function. The two arguments we must supply are x, the object, and which, which attribute we seek. In this case the values are col_1 and label respectively.

attr(col_1, "label")
## [1] "talkative"

purrr yet again makes working with list objects easy. purrr exports a function factory5 called purrr::attr_getter(). This function generates a function which accesses specific attributes. We can create a function get_label() using attr_getter() all we have to do is tell it which attribute we would like.

# generate a get_label f(x) via purr
get_label <- purrr::attr_getter("label")

get_label(col_1)
## [1] "talkative"

Well, lovely. Let’s just use this on our personality data frame.

get_label(personality)
## NULL

Ope. Welp. That didn’t work. We should just give up 🤷.

The reason this didn’t work is because we tried to use get_label() on a tibble which didn’t have the “label” attribute. We can verify this by looking at the list names of the attributes of personality.

names(attributes(personality))
## [1] "names"     "row.names" "class"

But what about the attribtues of each column? We can iterate over each column using map() to look at the attributes. Below I iterate over the first five columns. More on map()6.

map(select(personality, 1:5), ~names(attributes(.)))
## $PERS01
## [1] "label"       "format.spss"
## 
## $PERS02
## [1] "label"       "format.spss"
## 
## $PERS03
## [1] "label"       "format.spss"
## 
## $PERS04
## [1] "label"       "format.spss"
## 
## $PERS05
## [1] "label"       "format.spss"

Well, now that we’ve iterated over the columns and illustrated that the attributes live there, why not iterate over the columns and use get_label()?

# use get_label to retrieve column labels
map_chr(select(personality, 1:5), get_label)
##                PERS01                PERS02                PERS03 
##           "talkative"         "finds fault" "does a thorough job" 
##                PERS04                PERS05 
##           "depressed"            "original"

Again, yikes @ the labels. Let’s store these results into a vector so we can rename the original columns.

pers_labels <- map_chr(personality, get_label)

We can now change the names using setNames() from base R. We will then make the column headers tidy (personal definition of tidy column names) using janitor::clean_names().

setNames(personality, pers_labels) %>% 
  janitor::clean_names() %>% 
  select(1:5)
## # A tibble: 459 x 5
##    talkative finds_fault does_a_thorough_job depressed original
##        <dbl>       <dbl>               <dbl>     <dbl>    <dbl>
##  1         5           4                   5         1        4
##  2         1           1                   5         2        1
##  3         4           1                   5         3        3
##  4         4           2                   5         1        4
##  5         2           3                   5         1        2
##  6         1           1                   5         4        3
##  7         3           2                   5         1        2
##  8         5           2                   4         2        4
##  9         5           1                   4         3        2
## 10         4           1                   5         1        4
## # … with 449 more rows

In the case that a column doesn’t have a label, get_label() will return NULL and then setNames() will fail. To work around this, you can use the name of the column rather than the label value. Below is a function definition which handles this for you and, optionally, lets you specify which columns to rename based on a regex pattern. I think we’ve done enough list manipulation for the day. If you have questions about the function definition I’d be happy to work through it one on one with you via twitter DMs.

label_to_colname <- function(df, pattern) {
  get_label <- purrr::attr_getter("label")
  col_labels <- purrr::map(df, get_label)

  col_labels[unlist(map(col_labels, is.null))]  <- names(col_labels[unlist(purrr::map(col_labels, is.null))])

  if (missing(pattern)) {
    names_to_replace <- rep(TRUE, ncol(df))
  } else {
    names_to_replace <- stringr::str_detect(names(col_labels), pattern)
  }
  
  colnames(df)[names_to_replace] <- janitor::make_clean_names(unlist(col_labels[names_to_replace]))

  haven::zap_label(df)
}

label_to_colname(personality) %>% 
  select(1:5)
## # A tibble: 459 x 5
##    talkative finds_fault does_a_thorough_job depressed original
##        <dbl>       <dbl>               <dbl>     <dbl>    <dbl>
##  1         5           4                   5         1        4
##  2         1           1                   5         2        1
##  3         4           1                   5         3        3
##  4         4           2                   5         1        4
##  5         2           3                   5         1        2
##  6         1           1                   5         4        3
##  7         3           2                   5         1        2
##  8         5           2                   4         2        4
##  9         5           1                   4         3        2
## 10         4           1                   5         1        4
## # … with 449 more rows
# heh.
label_to_colname(personality, "PERS37") %>% 
  select(37)
## # A tibble: 459 x 1
##    sometimes_rude
##             <dbl>
##  1              4
##  2              1
##  3              3
##  4              3
##  5              5
##  6              1
##  7              2
##  8              2
##  9              1
## 10              1
## # … with 449 more rows

Gripe 3: user defined missing values

I’ll keep this one short. If there are user defined missing values in a .sav file, you can encode these as NA by setting the user_na arugment to TRUE.

# there aren't any missing values but you get the idea
noise <- haven::read_sav("http://staff.bath.ac.uk/pssiw/stats2/noisedata.sav",
                         user_na = TRUE)

And if for any reason that did not suffice, you can replace missing values with replace_na()7.

Take aways

  • All data is messy in it’s own way.
  • haven::read_sav() will read SPSS data.
  • haven::as_factor() will apply column labels in place of the numeric values (if present).
  • Replace user defined NA values by setting user_na = TRUE i.e. haven::read_sav("filepath.sav", user_na = TRUE)
  • All R objects can have attributes.
  • You can access attributes using attributes() or attr().
  • Data frames are made of vectors.
  • Data frames are actually just lists masquerading as rectangles.

  1. Check out forcats for working with factors.

  2. Anthony Horowitz wrote a rather fun murder mystery novel titled Magpie Murders which takes place in a small town outside of bath. I recommend it.

  3. http://adv-r.had.co.nz/Data-structures.html#data-frames

  4. http://adv-r.had.co.nz/Data-structures.html#attributes

  5. A function factory is an object which creates other function objects.

  6. https://adv-r.hadley.nz/functionals.html#map

  7. https://tidyr.tidyverse.org/reference/replace_na.html