python并行运算库

HiPlot is Facebook’s Python library to support visualization of high-dimensional data table, released this January. It is particularly well known for its sophisticated interactive parallel plot.

HiPlot是Facebook的Python库,用于支持高维数据表的可视化,该库于今年1月发布。 它以其复杂的交互式并行绘图而闻名。

Before anything, take a look at their compelling demo video. This explains how high its interactivity is, which we would appreciate when we do EDA.

在开始之前,请看一下他们引人注目的演示视频 。 这解释了它的交互性有多高,在进行EDA时我们将不胜感激。

, and play around their demo app with sample data here. Do not forget to select the range on axis and hold-and-move to check the interactivity.

,并在此处使用示例数据试用他们的演示应用程序。 不要忘记选择轴上的范围并按住不放以检查交互性。

HiPlot is not just good looking, but also has following four appreciated characteristics:

HiPlot不仅外观漂亮,而且还具有以下四个令人赞赏的特征:

  • Very easy to implement 非常容易实现

Implementing the parallel plot using their hiplot module is literally by one line and almost a no-brainer.

实际上,使用他们的hiplot模块实现并行绘图只需一行,几乎可以轻松完成 。

  • Highly interactive 高度互动

As you can see in the demo video above, the plot is highly interactive. Giving some mouse clicks lets you deep dive to any subset of data.

正如您在上面的演示视频中看到的那样,该情节具有高度的交互性。 只需单击一下鼠标,您就可以深入探究任何数据子集。

  • Run fast 快跑

Despite its appearance, the runtime to visualize the large dataset as a parallel plot is short. Let’s see this later.

尽管有它的外观,但以并行绘图的形式可视化大型数据集的运行时间很短。 让我们稍后再看。

  • Native HTML rendering function 本机HTML呈现功能

They prepared a native function to turn the parallel plot to HTML code (hooray!!) Produced HTML page can be downloaded as .html file or deployed from Flask with almost no additional rendering effort. I tried to run it on heroku through Flask in the exercise below.

他们准备了一个本机函数,可以将并行绘图转换为HTML代码(万岁!)。生成HTML页面可以作为.html文件下载,也可以从Flask进行部署,而几乎无需进行其他渲染。 我在下面的练习中尝试通过Flask在Heroku上运行它。

Thanks to these benefits, I believe HiPlot is one of first-choice tool sets for EDA in data analysis project before jumping in other time-consuming visualizations.

由于这些好处,我相信HiPlot是数据分析项目中EDA的首选工具集之一,然后再跳入其他耗时的可视化过程。

Let’s take a look at each of the benefits one by one.

让我们一一看一下每个好处。

(Getting Started with Iris Data)

Let’s see how easy it is to get you started with HiPlot using the famous iris data set.

让我们看看使用著名的虹膜数据集开始使用HiPlot是多么容易。

https://archive.ics.uci.edu/ml/datasets/iris) https://archive.ics.uci.edu/ml/datasets/iris )

Installation of HiPlot is just as easy as ordinary modules. Just use pip:

HiPlot的安装与普通模块一样容易。 只需使用pip:

pip install -U hiplot

To use external csv file, you even do not have to use Pandas DataFrame. They prepared a native method to run parallel plot directly from csv file as such Experiment.from_csv(). When you need to use DataFrame, use Experiment.from_dataframe() instead. And it is totally fine to use Jupyter Notebook.

要使用外部csv文件,您甚至不必使用Pandas DataFrame。 他们准备了一种本地方法,可以直接从csv文件运行平行绘图,例如Experiment.from_csv() 。 当您需要使用DataFrame时,请改用Experiment.from_dataframe() 。 并且使用Jupyter Notebook完全可以。

import hiplot as hip
iris_hiplot = hip.Experiment.from_csv('iris.csv')
iris_hiplot.display()

And here’s what you will see:

这是您将看到的:

Iris data HiPlot

虹膜数据HiPlot

(HiPlot is Highly Interactivity)

You are going to love HiPlot once you start playing around.

一旦开始玩耍,您就会爱上HiPlot。

As you already saw in the demo movie above, here are some examples of the uses of interactive chart.

正如您在上面的演示影片中已经看到的那样,下面是一些使用交互式图表的示例。

Gray rectangles are where I selected the range in each variable. We can also move the range by hold-and-move.

灰色矩形是我在每个变量中选择范围的地方。 我们也可以按住并移动范围。

Pointing the record in interest by mouse immediately highlights the record in the chart. 用鼠标指向感兴趣的记录将立即在图表中突出显示该记录。

(HiPlot Runs Fast)

To test the run time to show the parallel plot for larger data, I used Kaggle FIFA 19 complete player dataset. It is a dataset from a video game “FIFA 19”, as the name explains a game of soccer (or you may call football), where you can play as a team manager of a actual soccer team or as an actual soccer player. The dataset contains the all playable characters list with their skill levels (e.g. how good they are at heading, free kick etc.) The number of rows is 18,207 and of columns is 89. The first row is Messi from Argentina.

为了测试运行时显示更大数据的平行图,我使用了Kaggle FIFA 19完整播放器数据集 。 它是来自视频游戏“ FIFA 19”的数据集,顾名思义,这是一场足球比赛(或者您可以称为足球),您可以在其中扮演实际足球队的球队经理或实际足球运动员的角色。 数据集包含所有可玩角色列表以及他们的技能水平(例如,他们的头球,任意球等水平)。行数为18207,列数为89。第一行是阿根廷的梅西。

To do the performance test, I intentionally make the data size 10 times larger. I concatenated the FAFA 19 dataset 10 times, ended up with having 182,070 rows. Usually, visualizing large data set takes time to run and causes huge latency after the chart shows up. Interactivity? Forget and rerun the script! What about HiPlot?

为了进行性能测试,我故意将数据大小增大了10倍。 我将FAFA 19数据集连接了10次,最终有182,070行。 通常,可视化大数据集需要时间才能运行,并且在图表显示后会导致巨大的延迟。 互动性? 忘记并重新运行脚本! 那HiPlot呢?

I ran HiPlot on 182,070 rows x 89 columns dataset without any edits. It took 81 seconds to process and additional 5 mins to display the graph. I think it is short enough for its size.

我在182,070行x 89列数据集上运行了HiPlot,没有进行任何编辑。 处理花了81秒,另外花了5分钟显示图形。 我认为它足够短。

fifa = pd.read_csv('fifa19_data.csv',index_col='Unnamed: 0')
fifa_extended = pd.concat([fifa,fifa,fifa,fifa,fifa,fifa,fifa,fifa,fifa,fifa],axis=0,ignore_index=True,sort=False)
fifa_hiplot = hip.Experiment.from_dataframe(fifa_extended)
fifa_hiplot.display()

This is how messy it looks like by the way… 顺便说一句,这看起来很混乱……

What is even greater about HiPlot is that data size does not harm its interactivity too much. After selecting the range on axis, it tries to show only the plots belonging to the range. This change re-renders the plot almost within a second, which is surprising quick compared to other visualization tools.

HiPlot更重要的是,数据大小不会太大地损害其交互性。 在轴上选择范围后,它将尝试仅显示属于该范围的图。 此更改几乎在一秒钟内即可重新渲染图,与其他可视化工具相比,这是令人惊讶的快速。

Test selecting the ranges. 测试选择范围。

(Native HTML Rendering Function)

When we want to share the plot with someone else, we may run the code on notebook and copy-and-paste the plot, but it removes the graph interactivity.

当我们想与其他人共享绘图时,我们可以在笔记本上运行代码并复制并粘贴绘图,但是它删除了图形交互性。

HiPlot has its native HTML rendering function Experience.to_html(), which returns HTML file with the plot embedded with just one line of code.

HiPlot具有其本机HTML呈现功能Experience.to_html() ,该函数返回HTML文件,其中嵌入的图仅包含一行代码。

Let me go with the original rows and fewer columns from FIFA 19 dataset for simplicity.

为了简单起见,让我来看一下FIFA 19数据集中的原始行和较少的列。

fifa = pd.read_csv('fifa19_data.csv',index_col='Unnamed: 0')
fifa_small = fifa[['Age','Nationality','Value','Height', 'Weight', 'Crossing', 'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration']]
fifa_hiplot = hip.Experiment.from_dataframe(fifa_small)
fifa_hiplot.display()

This code will produce a new .html file on your local with the plot.

此代码将在您的本地图上生成一个新的.html文件。

_ = fifa_hiplot.to_html("fifa_hiplot.html")

Now you can share the .html file directly with your team members to let them play around the plot.

现在,您可以直接与团队成员共享.html文件,让他们在剧情中玩耍。

Rendered as ‘fifa_hiplot.html 渲染为'fifa_hiplot.html

Since Experience.to_html() returns html code, it’s even just easily connected to web server and deploy the plot externally.

由于Experience.to_html()返回html代码,因此它甚至可以轻松连接到Web服务器并在外部部署绘图。

Let me try it with Flask and Heroku. As I explained a lot here, I needed four files pushed to GitHub repository, to be synced with Heroku app.

让我与Flask和Heroku一起尝试。 正如我在这里所做的很多解释,我需要将四个文件推送到GitHub存储库,以便与Heroku应用程序同步。

  • fifa19_data.csv: FIFA19 dataset file. fifa19_data.csv :FIFA19数据集文件。
  • hiplot_fifa.py: Python code to run and kick off Flask app. hiplot_fifa.py :运行和启动Flask应用程序的Python代码。
import hiplot as hip
import pandas as pd
from flask import Flaskapp = Flask(__name__)@app.route('/')
def fifa_experiment():
    fifa = pd.read_csv('fifa19_data.csv',index_col='Unnamed: 0')
    fifa_hiplot = hip.Experiment.from_dataframe(fifa[['Age','Nationality','Value','Height', 'Weight', 'Crossing','Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration']])

    return fifa_hiplot.to_html()if __name__ == "__main__":
    app.run()
  • requirements.txt: a requirement file for Heroku to install the necessary modules. requirements.txt :Heroku安装必需模块的需求文件。
gunicorn==19.9.0
pandas==0.24.2
hiplot==0.1.12
Flask==1.1.1
  • Procfile: a start-up command for Heroku to run when it starts the app. Procfile :Heroku在启动应用程序时运行的启动命令。
web: gunicorn hiplot_fifa:app

Sync the code files and make a set up on Heroku. Opening the web app gives the following HiPlot graph page, which is globally accessible through domain name.

同步代码文件,并在Heroku上进行设置。 打开Web应用程序将显示以下HiPlot图表页面,该页面可通过域名进行全局访问。

Deployed HiPlot plot through Flask and served by Heroku. 通过Flask部署了HiPlot图,并由Heroku提供服务。

(Ending Note)

In this post, I introduced the HiPlot the current best option of parallel plot to start EDA to take a look at the variable interaction overview.

在本文中,我介绍了HiPlot(当前并行绘图的最佳选择)以启动EDA,以了解变量交互作用概述。

It is easy to run, highly interactive, fast, and with much shareable through HTML file or code.

它易于运行,高度交互,快速且可通过HTML文件或代码共享。

Although the data science projects definitely need further investigation such as looking at the shape of distribution of values, imputing the missing values etc, HiPlot would be a much help to start off the data preprocessing and understanding.

尽管数据科学项目确实需要进一步研究,例如查看值的分布形状,估算缺失值等,但HiPlot将对启动数据预处理和理解提供很大帮助。

翻译自: https://towardsdatascience.com/introduction-to-best-parallel-plot-python-library-hiplot-8387f5786d97

python并行运算库