本文翻译自Spark 2.2.0 - Cluster Mode Overview (http://spark.apache.org/docs/latest/cluster-overview.html)

一、Spark应用构成(Components)

Spark应用由集群上的一组独立的进程集构成,SparkContext对象对这些进程进行调度和协调(SparkContext对象在driver程序中创建)。

在将应用运行在集群上时,SparkContext可以连接几种不同的cluster manager(Spark自带的standalone cluster manager,Mesos或者YARN),cluster manager可以为各应用分配资源。一旦SparkContext与cluster manager连接,Spark从集群的节点上获取executors,executor为可以运行计算代码和存储数据的进程。获取到executors后,Spark将应用代码(JAR或者Python文件)发给各executor。最后,SparkContext将tasks发给executors运行。

spark python 打包到集群执行 python连接spark集群_大数据

 

关于此架构的几个注意点:

1.每个应用获得其执行进程(executor processes),这些进程以多线程方式运行任务(tasks),并且在整个应用运行期间保持在线。这样设计的优点在于可以让各应用程序保持相互独立,包括在任务调度(每个driver管理自己的tasks)方面和进程方面(不同应用的tasks运行在不同的Java虚拟机中)。但是,这也意味着在不将数据写入外部存储的情况下,不同应用之间是不可以共享数据的。

2.Spark不限制其底层的cluster manager类型。只要可以保证Spark能获取执行进程(executor processes),并且进程之间可以相互通信。

3.Driver程序必须保持一直监听和接受来自executors的接入连接。所以,driver程序需要对其工作节点(worker nodes)是网络可连接的。

4.由于driver需要调度集群上的任务(tasks),所以driver跟worker nodes应该距离上比较接近,最好是在同一个局域网(local area network)上。如果你想从遥远的地方给集群发送请求,最好给driver开一个RPC(Remote Procedure Call Protocol——远程过程调用协议,它是一种通过网络从远程计算机程序上请求服务,而不需要了解底层网络技术的协议),让它从worker nodes附近提交操作请求。

 

二、集群管理器(Cluster Manager Types)

系统目前支持三种集群管理器:

•Standalone - Spark自带的简单集群管理器

•Apache Mesos - 一个通用的集群管理器,也可以用来运行Hadoop MapReduce和服务应用。

•Hadoop YARN - Hadoop 2的资源管理器

•Kubernetes (experimental)

 

三、应用提交(Submitting Applications)

应用可以通过spark-submit脚本提交。详见http://spark.apache.org/docs/latest/submitting-applications.html

 

四、监控(Monitoring)

监控页面在http://<driver-node>:4040。更多监控选项详见http://spark.apache.org/docs/latest/monitoring.html

 

五、任务调度(Job Scheduling)

Spark提供跨应用(cluster manager层面)和应用内部(当多个计算发生在同一个SparkContext)的资源管理。详见http://spark.apache.org/docs/latest/job-scheduling.html

 

六、术语(Glossary)

Term

Meaning

Application

User program built on Spark. Consists of a driver program and executors on the cluster.

用户程序,包含集群上的driver程序和执行进程

Application jar

A jar containing the user's Spark application.   In some cases users will want to create an "uber jar" containing   their application along with its dependencies. The user's jar should never   include Hadoop or Spark libraries, however, these will be added at runtime.

包含用户代码和依赖的包。Hadoop和Spark依赖会在运行时加载,不能被包含的包中

Driver program

The process running the main() function of   the application and creating the SparkContext

运行main方法并创建SparkContext的进程

Cluster manager

An external service for acquiring resources   on the cluster (e.g. standalone manager, Mesos, YARN)

提供获取集群资源的外部服务

Deploy mode

Distinguishes where the driver process runs.   In "cluster" mode, the framework launches the driver inside of the   cluster. In "client" mode, the submitter launches the driver   outside of the cluster.

用来区分driver进程在哪儿加载

Worker node

Any node that can run application code in   the cluster

集群中的可以运行应用代码的节点

Executor

A process launched for an application on a   worker node, that runs tasks and keeps data in memory or disk storage across   them. Each application has its own executors.

worker节点上为应用程序加载的进程。该进程负责运行任务、保存数据到内存或者硬盘上。每个应用有自己的executors

Task

A unit of work that will be sent to one   executor

发送给executor的工作单元

Job

A parallel computation consisting of   multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this   term used in the driver's logs.

由Spark   action而产生的并行计算,由多个tasks组成

Stage

Each job gets divided into smaller sets of   tasks called stages that depend on each other (similar to the map and reduce   stages in MapReduce); you'll see this term used in the driver's logs.

每个job会被分成一些task的集合,一个task的集合称为一个stage,stage之间相互依赖