rstudio 管道符号
(R Fundamentals)
Data analysis often involves many steps. A typical journey from raw data to results might involve filtering cases, transforming values, summarising data, and then running a statistical test. But how can we link all these steps together, while keeping our code efficient and readable? Enter the pipe, R’s most important operator for data processing.
数据分析通常涉及许多步骤。 从原始数据到结果的典型过程可能涉及筛选案例,转换值,汇总数据,然后运行统计测试。 但是,如何在保持代码高效和可读性的同时将所有这些步骤链接在一起? 输入管道,R是数据处理中最重要的运算符。
(What does the pipe do?)
The pipe operator, written as %>%
, has been a longstanding feature of the magrittr package for R. It takes the output of one function and passes it into another function as an argument. This allows us to link a sequence of analysis steps.
管道运算符,写为%>%
,是R的magrittr包的一个长期功能。它将一个函数的输出传递给另一个函数作为参数。 这使我们可以链接一系列分析步骤。
To visualise this process, imagine a factory with different machines placed along a conveyor belt. Each machine is a function that performs a stage of our analysis, like filtering or transforming data. The pipe therefore works like a conveyor belt, transporting the output of one machine to another for further processing.
为了可视化此过程,请设想一家工厂,在传送带上放置不同的机器。 每台机器都是执行我们分析阶段的功能,例如过滤或转换数据。 因此,管道就像传送带一样工作,将一台机器的输出输送到另一台机器进行进一步处理。
A tasty analysis procedure. Image: Shutterstock
美味的分析程序。 图片:Shutterstock
We can see exactly how this works in a real example using the mtcars dataset. This dataset comes with base R, and contains data about the specs and fuel efficiency of various cars. The code below groups the data by the number of cylinders in each car, and then returns the mean miles-per-gallon of each group. Make sure to install the tidyverse suite of packages before running this code, since it includes both the pipe and the group_by
and summarise
functions.
我们可以使用mtcars数据集在一个实际示例中确切地看到它是如何工作的。 该数据集带有基础R,并包含有关各种汽车的规格和燃油效率的数据。 下面的代码按每辆车的气缸数对数据进行分组,然后返回每组的平均每加仑英里数。 请确保安装的tidyverse运行此代码之前包的套装,因为它包括管道和group_by
和summarise
的功能。
library(tidyverse)result <- mtcars %>%
group_by(cyl) %>%
summarise(meanMPG = mean(mpg))
The pipe operator feeds the mtcars dataframe into the group_by
function, and then the output of group_by
into summarise
. The outcome of this process is stored in the tibble result
, shown below.
管道操作符馈送mtcars数据帧到group_by
函数,然后输出group_by
成summarise
。 该过程的结果存储在小标题result
,如下所示。
Mean miles-per-gallon of vehicles in the mtcars dataset, grouped by number of engine cylinders.
mtcars数据集中车辆的平均每加仑英里数,按发动机气缸数分组。
Although this example is very simple, it demonstrates the basic pipe workflow. To go even further, I’d encourage playing around with this. Perhaps swap and add new functions to the ‘pipeline’ to gain more insight into the data. Doing this is the best way to understand how to work with the pipe. But why should we use it in the first place?
尽管此示例非常简单,但是它演示了基本的管道工作流程。 为了更进一步,我鼓励您尝试一下。 也许交换并向“管道”添加新功能以获得对数据的更多了解。 这样做是了解如何使用管道的最佳方法。 但是为什么我们首先要使用它呢?
(Why should we use the pipe?)
The pipe has a huge advantage over any other method of processing data in R: it makes processes easy to read. If we read %>%
as “then”, the code from the previous section is very easy to digest as a set of instructions in plain English:
与R中的任何其他数据处理方法相比,管道具有巨大的优势:它使过程易于阅读。 如果我们将%>%
读为“ then”,那么上一节中的代码很容易理解为一组简单的英语说明:
Load tidyverse packagesTo get our result, take the mtcars dataframe, THEN
Group its entries by number of cylinders, THEN
Compute the mean miles-per-gallon of each group
This is far more readable than if we were to express this process in another way. The two options below are different ways of expressing the previous code, but both are worse for a few reasons.
这比我们用另一种方式来表达此过程更具可读性。 下面的两个选项是表示先前代码的不同方式,但是由于一些原因,它们都较差。
# Option 1: Store each step in the process sequentially
result <- group_by(mtcars, cyl)
result <- summarise(result, meanMPG = mean(mpg))# Option 2: chain the functions together
> result <- summarise(
group_by(mtcars, cyl),
meanMPG = mean(mpg))
Option 1 gets the job done, but overwriting our output dataframe result
in every line is problematic. For one, doing this for a procedure with lots of steps isn’t efficient and creates unnecessary repetition in the code. This repetition also makes it harder to identify exactly what is changing on each line in some cases.
选项1可以完成工作,但是覆盖每一行的输出数据帧result
是有问题的。 首先,对具有很多步骤的过程执行此操作效率不高,并在代码中造成不必要的重复。 这种重复还使得在某些情况下更难于准确地确定每条线上的变化。
Option 2 is even less practical. Nesting each function we want to use gets ugly fast, especially for long procedures. It’s hard to read, and harder to debug. This approach also makes it tough to see the order of steps in the analysis, which is bad news if you want to add new functionality later.
选项2甚至不那么实用。 嵌套我们要使用的每个函数很快就会很麻烦,特别是对于长过程。 它很难阅读,也很难调试。 这种方法还使得很难查看分析中的步骤顺序,如果您以后要添加新功能,则这是个坏消息。
It’s easy to see how using the pipe can substantially improve most R scripts. It makes analyses more readable, removes repetition, and simplifies the process of adding and modifying code. Is there anything it can’t do?
很容易看到使用管道如何可以大大改善大多数R脚本。 它使分析更具可读性,消除重复,并简化了添加和修改代码的过程。 有什么不能做的吗?
(What are the pipe’s limitations?)
Although it’s immensely handy, the pipe isn’t useful in every situation. Here are a few of its limitations:
尽管非常方便,但是管道并不是在每种情况下都有用。 这里有一些限制:
- Because it chains functions in a linear order, the pipe is less applicable to problems that include multidirectional relationships.
- The pipe can only transport one object at a time, meaning it’s not so suited to functions that need multiple inputs or produce multiple outputs.
- It doesn’t work with functions that use the current environment, nor functions that use lazy evaluation. Hadley Wickham’s book “R for Data Science” has a couple of examples of these.
它不适用于使用当前环境的函数,也不适用于使用惰性求值的函数。 哈德利·威克汉姆(Hadley Wickham)的书“ R for Data Science”(数据科学的R)中有两个例子 。
These things are to be expected. Just as you’d struggle to build a house with a single tool, no lone feature will solve all your programming problems. But for what it’s worth, the pipe is still pretty versatile. Although this piece focused on the basics, there’s plenty of scope for using the pipe in advanced or creative ways. I’ve used it in a variety of scripts, data-focused and not, and it’s made my life easier in each instance.
这些事情是可以预期的。 就像您要用一个工具建造房屋一样,没有任何一项单独的功能可以解决您所有的编程问题。 但是,就其价值而言,管道仍然具有多种用途。 尽管本文着重介绍基础知识,但仍有许多以高级或创造性方式使用管道的范围。 我已经在各种脚本中使用了它,而不是关注数据的脚本,这使我的生活在每种情况下都更加轻松。
(Bonus pipe tips!)
Thanks for reading this far. As a reward, here are some bonus pipe tips and resources:
感谢您阅读本文。 作为奖励,这些是一些额外的管道技巧和资源:
- Fed up of awkwardly typing %>%? The slightly easier keyboard shortcut CTRL + SHIFT + M will print a pipe in RStudio! 受够了笨拙地输入%>%吗? 稍微更简单的键盘快捷键CTRL + SHIFT + M将在RStudio中打印管道!
- Need style guidance about how to format pipes? Check out this helpful section from ‘R Style Guide’ by Hadley Wickham. 需要有关如何格式化管道的样式指导? 查阅Hadley Wickham撰写的“ R风格指南”中的这一有用部分 。
- Want to learn a bit more about the history of pipes in R? Check out this blog post from Adolfo Álvarez. 想更多地了解R中管道的历史吗? 看看AdolfoÁlvarez的这篇博客文章 。
The pipe is great. It turns your code into a list of readable instructions and has lots of other practical benefits. So now you know about the pipe, use it, and watch your code turn into a narrative.
管子很棒。 它将您的代码转换为可读指令列表,并具有许多其他实际好处。 因此,现在您知道了管道,使用了管道,然后看着代码变成了叙述。
翻译自: https://towardsdatascience.com/an-introduction-to-the-pipe-in-r-823090760d64
rstudio 管道符号