python管道和数据共享

 

By Chris Musselle and Kate Ross-Smith

克里斯·穆瑟尔(Chris Musselle)和凯特·罗斯·史密斯(Kate Ross-Smith)

For a conference in the R language, EARL London 2015 saw a surprising number of discussions about Python. I like to think that at least some of this was to do with the fact that the day before the conference, we ran a 3-hour workshop outlining various strategies for integrating Python and R.

在以R语言召开的会议上,2015年EARL伦敦会议上出现了许多关于Python的令人惊讶的讨论。 我想认为这至少与以下事实有关:会议的前一天,我们举办了一个3小时的研讨会,概述了集成Python和R的各种策略。

This is the first in a series of three blog posts that:

这是三篇博客文章系列中的第一篇:

  • outline the basic strategy for integrating Python and R;
  • run through the different steps involved in this process; and
  • give a real example of how and why you would want to do this.
  • 概述集成Python和R的基本策略;
  • 执行此过程涉及的不同步骤; 和
  • 给出一个如何以及为什么要这样做的真实示例。

This post kicks everything off by:

这篇文章通过以下内容开始了一切:

  • covering the reasons why you may want to include both languages in a pipeline;
  • introducing ways of running R and Python from the command line; and
  • showing how you can accept inputs as arguments and write outputs to various file formats.
  • 说明您可能希望在管道中同时包含两种语言的原因;
  • 介绍从命令行运行R和Python的方法; 和
  • 显示如何接受输入作为参数并将输出写入各种文件格式。

为什么是“而不是”或“不是”? (Why “And” not “Or”?)

From a quick internet search for articles about “R Python”, of the top 10 results, only 2 discuss the merits of using both R and Python rather than pitting them against each other. This is understandable; from their inception, both have had very distinctive strengths and weaknesses. Historically, though, the split has been one of educational background: statisticians have preferred the approach that R takes, whereas programmers have made Python their language of choice. However, with the growing breed of data scientists, this distinction blurs:

在互联网上快速搜索的有关“ R Python”的文章中,排名前10的结果中,只有2个讨论了同时使用R和Python而不是相互竞争的优点。 这是可以理解的。 从一开始,两家公司就都具有非常鲜明的优势和劣势。 不过,从历史上看,这种分裂一直是教育背景之一:统计学家偏爱R采取的方法,而程序员则选择Python作为他们的选择语言。 但是,随着数据科学家的成长,这种区别变得模糊了:

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. — twitter @josh_wills

数据科学家(n。):在统计方面比任何软件工程师都出色并且在软件工程方面比任何统计学家都出色的人。 — twitter @josh_wills

With the wealth of distinct library resources provided by each language, there is a growing need for data scientists to be able to leverage their relative strengths. For example:

每种语言提供了丰富的独特图书馆资源,因此数据科学家越来越需要能够利用其相对优势。 例如:

Python tends to outperform R in such areas as:

Python在以下方面往往胜过R:

  • Web scraping and crawling: though rvest has simplified web scraping and crawling within R, Python’s beautifulsoup and Scrapy are more mature and deliver more functionality.
  • Database connections: though R has a large number of options for connecting to databases, Python’s sqlachemy offers this in a single package and is widely used in production environments.
  • Web抓取和抓取 :尽管rvest在R中简化了Web抓取和抓取,但是Python的beautifulsoup和Scrapy更成熟并且提供了更多功能。
  • 数据库连接 :尽管R有很多用于连接数据库的选项,但是Python的sqlachemy在单个程序包中提供了此选项,并在生产环境中广泛使用。

Whereas R outperforms Python in such areas as:

而R在以下方面的表现优于Python:

  • Statistical analysis options: though Python’s combination of Scipy, Pandas and statsmodels offer a great set of statistical analysis tools, R is built specifically around statistical analysis applications and so provides a much larger collection of such tools.
  • Interactive graphics/dashboards: bokeh, plotly and intuitics have all recently extended the use of Python graphics onto web browsers, but getting an example up and running using shiny and shiny dashboard in R is faster, and often requires less code.
  • 统计分析选项 :尽管Python的Scipy,Pandas和statsmodels组合提供了很多统计分析工具,但R专门围绕统计分析应用程序构建,因此可以提供更多此类工具。
  • 交互式图形/仪表板 :背景虚化,绘图和直观性最近都将Python图形的使用扩展到了Web浏览器,但是在R中使用闪亮的闪亮仪表板启动并运行示例更快,并且通常需要更少的代码。

Further, as data science teams now have a relatively wide range of skills, the language of choice for any application may come down to prior knowledge and experience. For some applications – especially in prototyping and development – it is faster for people to use the tool that they already know.

此外,由于数据科学团队现在具有相对广泛的技能,因此任何应用程序选择的语言都可能取决于先验知识和经验。 对于某些应用程序,尤其是在原型开发中,人们可以使用他们已经知道的工具更快。

平面文件“气隙”策略 (Flat File “Air Gap” Strategy)

In this series of posts we are going to consider the simplest strategy for integrating the two languages, and step though it with some examples. Using a flat file as an air gap between the two languages requires you to do the following steps.

在本系列文章中,我们将考虑整合这两种语言的最简单策略,并通过一些示例进行逐步介绍。 使用平面文件作为两种语言之间的气隙,需要执行以下步骤。

  1. Refactor your R and Python scripts to be executable from the command line and accept command line arguments.
  2. Output the shared data to a common file format.
  3. Execute one language from the other, passing in arguments as required.
  4. 将R和Python脚本重构为可从命令行执行并接受命令行参数。
  5. 将共享数据输出为通用文件格式。
  6. 从另一种语言执行一种语言,并根据需要传递参数。

Pros

优点

  • Simplest method, so commonly the quickest
  • Can view the intermediate outputs easily
  • Parsers already exist for many common file formats: CSV, JSON, YAML
  • 最简单的方法,因此通常最快
  • 可以轻松查看中间输出
  • 多种常见文件格式的解析器已经存在:CSV,JSON,YAML

Cons

缺点

  • Need to agree upfront on a common schema or file format
  • Can become cumbersome to manage intermediate outputs and paths if the pipeline grows.
  • Reading and writing to disk can become a bottleneck if data becomes large.
  • 需要事先就通用模式或文件格式达成协议
  • 如果管道增长,可能难以管理中间输出和路径。
  • 如果数据变大,则对磁盘的读写可能会成为瓶颈。

命令行脚本 (Command Line Scripting)

Running scripts from the command line via a Windows/Linux-like terminal environment is similar in both R and Python. The command to be run is broken down into the following parts,

在R和Python中,通过类似于Windows / Linux的终端环境从命令行运行脚本是相似的。 要运行的命令分为以下几部分:

<command_to_run> <path_to_script> <any_additional_arguments>

where:

哪里:

  • <command> is the executable to run (Rscript for R code and Python
  • <path_to_script>
  • <any_additional_arguments>
  • <command>是要运行的可执行文件(R 脚本用于R代码, Python用于Python代码),
  • <path_to_script>是正在执行的脚本的完整或相对文件路径。 请注意,如果路径名中有空格,则整个文件路径必须用双引号引起来。
  • <any_additional_arguments>这是解析到脚本本身的空格分隔参数的列表。 请注意,这些将作为字符串传递。

So for example, an R script is executed by opening up a terminal environment and running the following:

因此,例如,通过打开终端环境并运行以下命令来执行R脚本:

Rscript path/to/myscript.R arg1 arg2 arg3

A Few Gotchas

几个陷阱

  • For the commands Rscript and Python
  • Path names with spaces create problems, especially on Windows, and so must be enclosed in double quotes so they are recognised as a single file path.
  • Rscript和Python命令,这些可执行文件必须已经在您的路径上。 否则,必须提供其在文件系统上位置的完整路径。
  • 带有空格的路径名会引起问题,尤其是在Windows上,因此必须用双引号引起来,以便将它们识别为单个文件路径。

在R中访问命令行参数 (Accessing Command Line Arguments in R)

In the above example where arg1, arg2 and arg3 are the arguments parsed to the R script being executed, these are accessible using the commandArgs

arg1 , arg2和arg3是解析到正在执行的R脚本的参数,可以使用commandArgs函数访问这些参数。

## myscript.R

## myscript.R

# Fetch command line arguments
myArgs # Fetch command line arguments
myArgs <- <- commandArgscommandArgs (trailingOnly ( trailingOnly = = TRUETRUE )

)

# myArgs is a character vector of all arguments
# myArgs is a character vector of all arguments
printprint (myArgs( myArgs )
)
printprint (( classclass (myArgs( myArgs ))
))

By setting trailingOnly = TRUE, the vector myArgs only contains arguments that you added on the command line. If left as FALSE

TrailingOnly = TRUE ,向量myArgs仅包含您在命令行上添加的参数。 如果保留为FALSE

在Python中访问命令行参数 (Accessing Command Line Arguments in Python)

For a Python script executed by running the following on the command line

对于通过在命令行上运行以下命令执行的Python脚本

python path/to/myscript.py arg1 arg2 arg3

the arguments arg1, arg2 and arg3 can be accessed from within the Python script by first importing the sys module. This module holds parameters and functions that are system specific, however we are only interested here in the argv attribute. This argv

sys模块,可以从Python脚本中访问参数arg1 , arg2和arg3 。 该模块包含特定于系统的参数和函数,但是我们仅对argv属性感兴趣。 此argv属性是传递给当前正在执行的脚本的所有参数的列表。 此列表中的第一个元素始终是要执行的脚本的完整文件路径。

If you only wished to keep the arguments parsed into the script, you can use list slicing to select all but the first element.

如果只希望将参数解析到脚本中,则可以使用列表切片来选择除第一个元素以外的所有元素。

# Using a slice, selects all but the first element
# Using a slice, selects all but the first element
my_args my_args = = syssys .. argvargv [[ 11 :]
:]

As with the above example for R, recall that all arguments are parsed in as strings, and so will need converting to the expected types as necessary.

与上面R的示例一样,请记住所有参数都被解析为字符串,因此需要将其转换为所需的类型。

将输出写入文件 (Writing Outputs to a File)

You have a few options when sharing data between R and Python via an intermediate file. In general for flat files, CSVs are a good format for tabular data, while JSON or YAML are best if you are dealing with more unstructured data (or metadata), which could contain a variable number of fields or more nested data structures.

通过中间文件在R和Python之间共享数据时,您有几种选择。 通常,对于平面文件而言,CSV是用于表格式数据的一种很好的格式,而JSON或YAML是最好的格式,如果您要处理的非结构化数据(或元数据)可能包含可变数量的字段或更多的嵌套数据结构。

All these are very common data serialisation formats, and parsers already exist in both languages. In R the following packages are recommended for each format:

所有这些都是非常常见的数据序列化格式 ,并且解析器已经以两种语言存在。 在R中,建议为每种格式使用以下软件包:

  • readr for CSV files
  • jsonlite for JSON files
  • yaml for YAML files
  • readr为CSV文件
  • JSONlite用于JSON文件
  • yaml(用于YAML文件)

And in Python:

在Python中:

  • csv for CSV files
  • json for JSON files
  • PyYAML for YAML files
  • CSV文件的csv
  • JSON文件的json
  • 用于YAML文件的PyYAML

The csv and json modules are part of the Python standard library, distributed with Python itself, whereas PyYAML will need installing separately. All R packages will also need installing in the usual way.

csv和json模块是Python标准库的一部分,随Python本身一起分发,而PyYAML将需要单独安装。 所有R软件包也将需要以通常的方式进行安装。

摘要 (Summary)

So passing data between R and Python (and vice-versa) can be done in a single pipeline by:

因此,可以在单个管道中通过以下方式在R和Python之间传递数据(反之亦然):

  • using the command line to transfer arguments, and
  • transferring data through a commonly-structured flat file.
  • 使用命令行来传递参数,以及
  • 通过通用结构的平面文件传输数据。

However, in some instances, having to use a flat file as an intermediate data store can be both cumbersome and detrimental to performance. In the next post, we will look at how R and Python can directly call each other and return the output in memory.

但是,在某些情况下,必须将平面文件用作中间数据存储既麻烦又有损于性能。 在下一篇文章中,我们将研究R和Python如何直接相互调用并在内存中返回输出。

翻译自: https://www.pybloggers.com/2015/10/integrating-python-and-r-into-a-data-analysis-pipeline-part-1/

python管道和数据共享