An absolute beginner’s guide to creating data frames for a Stack Overflow [r] question

For better or worse I spend some time each day at Stack Overflow [r], reading and answering questions. If you do the same, you probably notice certain features in questions that recur frequently. It’s as though everyone is copying from one source – perhaps the one at the top of the search results. And it seems highest-ranked is not always best.

Nowhere is this more apparent to me than in the way many users create data frames. So here is my introductory guide “how not to create data frames”, aimed at beginners writing their first questions.

1. No need for vectors

There is no need to create vectors first and then add them as columns:

x <- 1:2
y <- 3:4

df <- data.frame(x, y)

# just do this!
df <- data.frame(x = 1:2, y = 3:4)

If you really need the columns as vectors, they can always be obtained using df$x or df$y.

While we’re here, that df thing…

2. …df is not a great variable name

Sure, you can call a variable df and R will know when you mean that variable and when you mean the function, df(). But why risk the confusion, when you could just call it something else? Like df1. Or mydata. Or example.

3. No need to convert from a matrix

Here’s another rather bizarre way to make a data frame that I often see:

df1 <- matrix(1:4, ncol = 2, nrow = 2)
df1 <- as.data.frame(df1)

# or perhaps to name columns
df1 <- matrix(1:4, 2, 2, dimnames = list(c(1, 2), c("x", "y")))
df1 <- as.data.frame(df1)

Which would again be better achieved simply using data.frame():

df1 <- data.frame(x = 1:2, y = 3:4)

Using a matrix is especially problematic when you want to mix variable types, which is possible in data frames but not in matrices. Here, our numbers become characters in the matrix and hence factors in the data frame:

df1 <- matrix(c(1:2, letters[1:2]), 2, 2, dimnames = list(c(1, 2), c("x", "y")))
df1 <- as.data.frame(df1)

# oh look, your numbers are now factors, that's not what you want
str(df1)

'data.frame':	2 obs. of  2 variables:
 $ x: Factor w/ 2 levels "1","2": 1 2
  ..- attr(*, "names")= chr  "1" "2"
 $ y: Factor w/ 2 levels "a","b": 1 2
  ..- attr(*, "names")= chr  "1" "2"

Which brings us to…

4. …No strings as factors

Ever.

df1 <- data.frame(x = 1:2, y = letters[1:2], stringsAsFactors = FALSE)

5. Consider the alternatives and use the inbuilt help

You might consider the newer tibble in which strings are never factors, amongst other advantages such as pretty printing with information about variables. The syntax is just the same:

library(tibble)
df1 <- tibble(x = 1:2, y = 3:4)

And when you know the command name – data.frame for example, help is only “?” + command_name away. It isn’t always the best documentation, but it does generally tell you all you need to know.

7 thoughts on “An absolute beginner’s guide to creating data frames for a Stack Overflow [r] question

  1. I disagree for tibble. Adding an unnecessary dependency for a StackOverflow question adds an extra layer of complexity, given that tibble is a wrapper for data.frame. My preferred solution would be to prefix the code with options(stringsAsFactors=FALSE).

    Also, I think df as a name for a data.frame is perfectly fine, and is comprehensible in almost every instance. tidyverse devs also seem to disagree with you.

    If anything, df1 is a worse name, as it invites “df1”, “df2”, etc, which are bad names because they imply difference without making clear what each data.frame is.

  2. A couple of reasonable options: tibble::tribble() allows entering tabular data in a fairly natural format. Pasting the output of dput() is helpful for real-world data.

  3. Thanks both for the article AND reading/responding to people’s questions. I found a lot of illuminating and helpful information for if and when I finally get around to posting one of my (many) questions.

  4. Lots of good info. But: “No strings as factors. Ever.”… even when the data is static and when the main purpose of the string variable is to serve as a category variable in a regression (which requires it to be a factor)?

  5. Pingback: Tips For Creating Sample Data Frames – Curated SQL

Comments are closed.