If you’ve been introduced to R as a simple way to do data analysis you might have come across this strange operator, %>%
. It’s called a pipe because it passes data from one function to another. Here’s an example of subsetting and transforming data using the pipe from the magrittr package:
library(magrittr)
dat <- airquality %>%
subset(Ozone > 40) %>%
transform(Celsius = (Temp - 32) * (5/9)) %>%
head()
dat
## Ozone Solar.R Wind Temp Month Day Celsius
## 1 41 190 7.4 67 5 1 19.44444
## 29 45 252 14.9 81 5 29 27.22222
## 30 115 223 5.7 79 5 30 26.11111
## 40 71 291 13.8 90 6 9 32.22222
## 62 135 269 4.1 84 7 1 28.88889
## 63 49 248 9.2 85 7 2 29.44444
The first line can be read, “I’m going to make a new object called dat
and it’s going to start with the airquality
data frame”. The %>%
at the end of the first line pipes the data frame to the next line, which is the subset
function. If you look at the documentation for subset
, the first argument is x
, an “object to be subsetted”. The %>%
takes the data frame immediately before it and places it in the first argument of the function immediately following it. So airquality
becomes the object to be subsetted in the subset
function.
Since the pipe has already assigned a data frame to the first argument of subset
, the next argument in the function is a logical expression that is used to select rows to keep (i.e., subset the data frame). I want to keep all rows where the ozone values are above 40.
Once the concept sinks in, you can easily read the rest of the code. The output of subset
is piped to the first argument of transform
. The argument that I have inside of transform
is assigned to the second argument, and the output of transform
is passed on to the first argument of head
.
dplyr
So why use the pipe? For one thing, you avoid reassigning the data frame every time you change it. Here’s the subset/transformation from above without the pipes.
dat <- subset(airquality, Ozone > 40)
dat <- transform(dat, Celsius = (Temp - 32) * (5/9))
dat <- head(dat)
dat
## Ozone Solar.R Wind Temp Month Day Celsius
## 1 41 190 7.4 67 5 1 19.44444
## 29 45 252 14.9 81 5 29 27.22222
## 30 115 223 5.7 79 5 30 26.11111
## 40 71 291 13.8 90 6 9 32.22222
## 62 135 269 4.1 84 7 1 28.88889
## 63 49 248 9.2 85 7 2 29.44444
You not only avoid reassigning the data frame every time, but you don’t have to type the data frame object as the first argument in each function.
Admittedly, the amount of typing being saved is minimal. The other main reason to use pipes is the benefit of chaining dplyr functions together. Those functions were written with the pipe in mind.
library(dplyr)
dat <- airquality %>%
filter(Ozone > 40) %>%
mutate(Celsius = (Temp - 32) * (5/9)) %>%
head()
dat
## Ozone Solar.R Wind Temp Month Day Celsius
## 1 41 190 7.4 67 5 1 19.44444
## 2 45 252 14.9 81 5 29 27.22222
## 3 115 223 5.7 79 5 30 26.11111
## 4 71 291 13.8 90 6 9 32.22222
## 5 135 269 4.1 84 7 1 28.88889
## 6 49 248 9.2 85 7 2 29.44444
Once you get used to using the pipe, you gain the ability to quickly read a chain of dplyr functions. And this can speed up your production significantly.