Spark的产生背景和基础知识

整理自 Spark: The Definitive Guide
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Apache Spark:一个一体化的计算引擎,一组库,用于在计算机集群上并行处理数据。

spark设计 spark设计背景_应用程序

Apache Spark的哲学

unified

一体化,支持多种数据分析,使用同一个计算引擎和同一组APIs处理从简单的数据加载,SQL查询到机器学习,流式计算的计算任务。

Computing engine

计算引擎,Spark限制自己的角色为计算引擎,尽管Spark处理数据的加载和计算,但本身不会永久的存储数据。

Libraries

为通用数据处理提供一体化的API,库包含SQL and structured data (Spark SQL), machine learning (MLlib), stream processing (Spark Streaming and the newer Structured Streaming), and graph analytics (GraphX),以及大量的第三方库(连接不同的存储系统、机器学习算法等)。

Context:The Big Data Problem 背景,大数据的困境

在很长一段时间,计算机依靠处理器速度的提升,计算速度不断提高。由此,应用程序不需要修改代码,运行速度也随之变快,由此构造了更大的应用程序生态系统,其中多数应用程序多数都是单线程。这些应用程序使得计算机的运算量和数据量都不断增大。

不幸的是,大约在2005年,这一硬件的发展趋势停止了。由于散热等限制,硬件开发人员停止了提高单个处理器的运算速度,而是转向多CPU核并行。这使得应用程序突然需要增加并行能力以提高运行速度,为新的程序模型如Apache Spark提供了平台。

还有一点关键的,那就是尽管处理器的速度慢了下来,但是存储、手机数据的技术并没有减慢。每隔14个月,存储1TB数据的成本大约会降低两倍。此外,收集数据的技术(传感器,照相机,公共数据等)也在持续下降,分辨率不断提高。

结果就是,数据的采集成本越来越低,但是处理数据往往需要大型的,并行的计算机,通常是机器集群。而且传统的软件并不能自动适应现在的环境和需求,产生了对新型编程模型的需求。这就是Apache Spark产生的背景。

Spark’s Basic Architecture Spark的基本构架

A cluster, or group, of computers, pools the resources of many machines together, giving us the ability to use all the cumulative resources as if they were a single computer. Now, a group of machines alone is not powerful, you need a framework to coordinate work across them. Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers.
一个计算机集群,或者组,把许多机器的资源集中到一起,使我们可以像使用一台电脑一样使用所有累积的资源。现在,仅仅是一组机器还不够强大,你需要一个框架来协调这些电脑。Spark就是完成这个工作,管理和协调处理分布在计算机集群上数据的任务的运行。

Spark Application Spark应用程序

Spark Applications consist of a driver process and a set of executor processes. The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things: maintaining information about the Spark Application; responding to a user’s program or input; and analyzing, distributing, and scheduling work across the executors (discussed momentarily). The driver process is absolutely essential—it’s the heart of a Spark Application and maintains all
relevant information during the lifetime of the application.
Spark应用程序有一个Driver进程和一组executor进程组成。Driver进程运行main()函数,运行在集群的某一个节点上,负责三个事情:维护spark应用程序的信息;相应用户程序或输入;在executor之间分析,分布,计划工作。driver进程是绝对必要的——是Spark应用程序的核心,维护应用程序整个生命周期的相关信息。

The executors are responsible for actually carrying out the work that the driver assigns them. This means that each executor is responsible for only two things: executing code assigned to it by the driver, and reporting the state of the computation on that executor back to the driver node.
executor负责实际执行driver分配给他们的任务。意味着每个executor负责两件事:执行driver分配给他们的代码,将executor计算的状态反馈给driver节点。

spark设计 spark设计背景_spark设计_02


**

NOTE

Spark, in addition to its cluster mode, also has a local mode. The driver and executors are simply processes, which means that they can live on the same machine or different machines. In local mode, the driver and executurs run (as threads) on your individual computer instead of a cluster. We wrote this book with local mode in mind, so you should be able to run everything on a single machine.

Spark除了集群模式,还有本地模式。driver和executor仅仅只是进程,既可以运行在同一台机器上,也可以运行在不同的机器上。在本地模式下,driver和executor以!!!线程!!!的方式运行在单个计算机上,而不是在集群上。我们写这本书是基于本地模式的,所以你应该可以在一台机器上运行所有程序。

**

Here are the key points to understand about Spark Applications at this point:

  • Spark employs a cluster manager that keeps track of the resources available.
  • The driver process is responsible for executing the driver program’s commands across the executors to complete a given task.

这里是一个有助于理解Spark应用程序的关键点:

  • Spark 应用一个集群管理员,持续跟踪可用的资源
  • Driver进程负责在executor之间执行driver程序,已完成安排的任务

Spark’s Language APIs Spark语言APIs

Scala
Spark is primarily written in Scala, making it Spark’s “default” language. This book will include Scala code examples wherever relevant.
Scala:Spark主要由Scala语言编写,Scala是Spark的默认语言。这本书会相关的地方会包含Scala代码的案例。

Java
Even though Spark is written in Scala, Spark’s authors have been careful to ensure that you can write Spark code in Java. This book will focus primarily on Scala but will provide Java examples where relevant.
Java:尽管Spark是由Scala编写,Spark的作者小心的确保你可以在用Java编写Spark代码。这本书会主要使用Scala语言,但在相关的地方提供Java案例。

Python
Python supports nearly all constructs that Scala supports. This book will include Python code examples whenever we include Scala code examples and a Python API exists.
Python:Python支持几乎所有Scala支持的结构。当包含Scala代码案例同时存在Python API时,这本书会包含Python代码。

SQL
Spark supports a subset of the ANSI SQL 2003 standard. This makes it easy for analysts and non-programmers to take advantage of the big data powers of Spark. This book includes SQL code examples wherever relevant.
SQL:Spark支持ANSI SQL 2003标准的子集。这使得分析人员和非程序员可以轻易的使用Spark在大数据上的力量。这本书在相关的地方会包含SQL代码。

R

Spark has two commonly used R libraries: one as a part of Spark core (SparkR) and another as an R community-driven package (sparklyr). We cover both of these integrations in Chapter 32.

Spark有两个广泛使用的R库:一个是部分Spark core(SparkR),另一个是R 社区驱动包(Sparklyr)。我们在32章中包含了这些集成。

spark设计 spark设计背景_Scala_03

Spark’S APIs

Spark has two fundamental sets of APIs: the low-level “unstructured” APIs, and the higher-level structured APIs.
Spark有两个基础的API集合:低级的非结构化APIs,高级的结构化APIs。

The SparkSession Spark会话

As discussed in the beginning of this chapter, you control your Spark Application through a driver process called the SparkSession. The SparkSession instance is the way Spark executes user-defined manipulations across the cluster. There is a one-to-one correspondence between a SparkSession and a Spark Application. In Scala and Python, the variable is available as spark when you start the console. Let’s go ahead and look at the SparkSession in both Scala and/or Python:

spark

In Scala, you should see something like the following:

res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@…

正如在本章开头所讨论的,你通过一个称之为SparkSession的driver进程控制Spark应用程序。SparkSession实例是Spark在集群之间执行用户定义操作的方法。一个SparkSession和一个Spark应用程序存在一一对应的关系。在Scala或Python中,变量可以通过spark来获取当你启动控制台。我们先来看看SparkSession在Scala和Python中的样子:

spark

在scala中结果如下:

res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@…

spark设计 spark设计背景_Python_04

DataFrames 数据框架

A DataFrame is the most common Structured API and simply represents a table of data with rows and columns. The list that defines the columns and the types within those columns is called the schema.
数据框架是最常用的结构化API,表示一张带有行和列的二维表。定义这些列中列和列的类型的列表称之为schema(模式)。
A spreadsheet sits on one computer in one specific location, whereas a Spark DataFrame can span thousands of computers.
一张电子表只能存在于一个电脑的一个特定的位置,而一个Spark DataFrame可以横跨成千上万台电脑。

Python/R DataFrames (with some exceptions) exist on one machine rather than multiple machines.However, because Spark has language interfaces for both Python and R, it’s quite easy to convert Pandas (Python) DataFrames to Spark DataFrames, and R DataFrames to Spark DataFrames.
Python/R DataFrames通常只能存在一台机器上,而不是分布在多台机器上。然后,因为Spark有Python和R语言的接口,可以轻易的从Pandas(Python)DataFrame, R DataFrame转化成Spark DataFrame。

Spark的分布式数据集合: DataSets、DataFrames、SQL Tables、Resilient Distributed Datasets

Partition

To allow every executor to perform work in parallel, Spark breaks up the data into chunks called partitions. A partition is a collection of rows that sit on one physical machine in your cluster. A DataFrame’s partitions represent how the data is physically distributed across the cluster of machines during execution. If you have one partition, Spark will have a parallelism of only one, even if you have thousands of executors. If you have many partitions but only one executor, Spark will still have a parallelism of only one because there is only one computation resource.
An important thing to note is that with DataFrames you do not (for the most part) manipulate partitions manually or individually. You simply specify high-level transformations of data in the physical partitions, and Spark determines how this work will actually execute on the cluster. Lower-level APIs do exist (via the RDD interface), and we cover those in Part III.

为了运行每一个executor能够并行工作,Spark把数据分割成称之为分区的块。一个分区是指位于集群中一个物理机器的数据行的集合。一个DataFrame分区代表运行时数据物理上是如何分布在集群上的机器中。如果你有只有一个分区,Spark只会并行一个任务,即使你有数千个executor。如果你有很多分区但是只有一个executor,Spark仍然只能并行其中一个分区因为只有一个计算资源。
一个重要的事情是,使用DataFrame时你不会(多数情况下)手动或单独操作分区。你只是简单的指定数据在物理分区的高级变换,Spark决定这个工作如何实际在集群上运行。低级API存在这种情况,我们会在第三部分讨论。

Transformations 变换

In Spark, the core data structures are immutable, meaning they cannot be changed after they’re created. This might seem like a strange concept at first: if you cannot change it, how are you supposed to use it? To “change” a DataFrame, you need to instruct Spark how you would like to modify it to do what you want. These instructions are called transformations. Let’s perform a simple transformation to find all even numbers in our current DataFrame:
in Scala
val divisBy2 = myRange.where(“number % 2 = 0”)
in Python
divisBy2 = myRange.where(“number % 2 = 0”)

在Spark中,核心数据结构是不可变的,意味着他们创建之后就不能修改。“修改”一个DataFrame,你需要指导Spark你希望如何修改数据达到目的。这些指令都被成为变换。

Notice that these return no output. This is because we specified only an abstract transformation, and Spark will not act on transformations until we call an action (we discuss this shortly).
注意这些操作没有输出。这是因为我们只制定了抽象的变换,Spark不会实施变换直到我们调用一个行动。
Transformations are the core of how you express your business logic using Spark. There are two types of transformations: those that specify narrow dependencies, and those that specify wide dependencies.
变换是在Spark中表达业务逻辑的核心。有两种变换:窄依赖和宽依赖。

Transformations consisting of narrow dependencies (we’ll call them narrow transformations) are

those for which each input partition will contribute to only one output partition. In the preceding code

snippet, the where statement specifies a narrow dependency, where only one partition contributes to

at most one output partition, as you can see in Figure 2-4.

窄依赖变换是指输入分区只会分配到一个输出分区。

spark设计 spark设计背景_spark设计_05


Figure 2-4. A narrow dependencyA wide dependency (or wide transformation) style transformation will have input partitions contributing to many output partitions. You will often hear this referred to as a shuffle whereby Spark will exchange partitions across the cluster. With narrow transformations, Spark will automatically perform an operation called pipelining, meaning that if we specify multiple filters on DataFrames, they’ll all be performed in-memory. The same cannot be said for shuffles. When we perform a shuffle, Spark writes the results to disk. Wide transformations are illustrated in Figure 2-5.

宽依赖的变换会有输入分区分配到多个输出分区。shuffle过程在集群中交换分区,会经常引用这一概念。对于窄依赖,Spark会自动运行pipelining操作,意味着如果我们对DataFrame指定多个过滤操作,他们会在内存中运行。shuffle过程这不是这样。当我们执行shuffle,Spark将结果写入磁盘。

spark设计 spark设计背景_应用程序_06


Figure 2-5. A wide dependency

Lazy Evaluation

Lazy evaulation means that Spark will wait until the very last moment to execute the graph of computation instructions. In Spark, instead of modifying the data immediately when you express some operation, you build up a plan of transformations that you would like to apply to your source data. By waiting until the last minute to execute the code, Spark compiles this plan from your raw DataFrame transformations to a streamlined physical plan that will run as efficiently as possible across the cluster. (直到执行将要执行代码,Spark将原始DataFrame的变换编译成流水线的物理计划,尽可能高效的在集群上执行)This provides immense benefits because Spark can optimize the entire data flow from end to end. An example of this is something called predicate pushdown on DataFrames. If we build a large Spark job but specify a filter at the end that only requires us to fetch one row from our source data, the most efficient way to execute this is to access the single record that we need. Spark will actually optimize this for us by pushing the filter down automatically.

Actions

Transformations allow us to build up our logical transformation plan. To trigger the computation, we run an action. An action instructs Spark to compute a result from a series of transformations. The simplest action is count, which gives us the total number of records in the DataFrame:
divisBy2.count()
变换算子运行我们建立逻辑变换计划。触发计算,我们运行一个行动算子。一个行动算子指导Spark通过一系列变换算子计算结果。最简单的行动算子是计数,可以给我们提供DataFrame里面记录的总数。
The output of the preceding code should be 500. Of course, count is not the only action. There are three kinds of actions:
Actions to view data in the console-----在控制台查看数据的算子
Actions to collect data to native objects in the respective language-----收集数据到本地对象的算子
Actions to write to output data sources-----写入到输出数据源的算子