spark本地测试验证 spark本地调试local

转载

coolfengsy 2024-05-07 12:22:04

文章标签 spark本地测试验证 spark big data 大数据 jar 文章分类 Spark 大数据

1.spark部署模式

spark本地测试验证 spark本地调试local_jar

1.1 Local模式

Local模式就是运行在一台计算机上的模式，通常就是用于在本机上练手和测试。它可以通过以下集中方式设置master。

local: 所有计算都运行在一个线程当中，没有任何并行计算，通常我们在本机执行一些测试代码，或者练手，就用这种模式。
local[K]: 指定使用几个线程来运行计算，比如local[4]就是运行4个worker线程。通常我们的cpu有几个core，就指定几个线程，最大化利用cpu的计算能力
local[*]: 这种模式直接帮你按照cpu最多cores来设置线程数了。

/bin/spark-submit \

--cluster cluster_name \

--master local[*] \

...

这几种local模式都是运行在本地的单机版模式，通常用于练手和测试，而实际的大规模计算就需要下面要介绍的cluster模式。

1.2 cluster模式

cluster模式肯定就是运行很多机器上了，但是它又分为以下三种模式，区别在于谁去管理资源调度。（说白了，就好像后勤管家，哪里需要资源，后勤管家要负责调度这些资源）

1.2.1 standalone模式

这种模式下，Spark会自己负责资源的管理调度。它将cluster中的机器分为master机器和worker机器，master通常就一个，可以简单的理解为那个后勤管家，worker就是负责干计算任务活的苦劳力

使用standalone模式示例：

/bin/spark-submit \
--cluster cluster_name \
--master spark://host:port \
...

--master就是指定master那台机器的地址和端口，我想这也正是--master参数名称的由来吧。

1.2.2 mesos模式

这里就很好理解了，如果使用mesos来管理资源调度，自然就应该用mesos模式了，示例如下：

/bin/spark-submit \
--cluster cluster_name \
--master mesos://host:port \
...

1.2.3 yarn模式

同样，如果采用yarn来管理资源调度，就应该用yarn模式，由于很多时候我们需要和mapreduce使用同一个集群，所以都采用Yarn来管理资源调度，这也是生产环境大多采用yarn模式的原因。yarn模式又分为yarn cluster模式和yarn client模式：

yarn cluster: 这个就是生产环境常用的模式，所有的资源调度和计算都在集群环境上运行，客户端只负责提交应用程序。Driver 运行在Application Master中，当用户提交了作业之后，就可以关于关闭Client，作业会继续在YARN上运行，因而YARN-cluster模式不适合进行交互式类型的作业
yarn client: Spark Driver在本机运行，而计算任务在cluster上。可以使Spark Application和客户端进行交互。Application Master仅仅向YARN请求Executor，Client会和请求的Container的通信来调度它们工作，Client是不能离开的

2.提交任务

spark-submit \
--master yarn \
--queue test_queue \
--deploy-mode client \
--driver-memory 10g \
--executor-cores 2 \
--executor-memory 1G \
--num-executors 8  \
--class com.data.Test  \
hdfs:user/test.jar arg1 arg2


spark-sql \ 
--queue test_queue \
--deploy-mode client \ 
--num-executors 10 \
--executor-memory 10g \
--executor-cores 5


spark-shell \
--queue test_queue

3.参数说明

参数	参数说明	举例
--master	master的地址，即提交任务在哪里执行	--master yarn
--deploy-mode	driver程序运行的位置	client：driver程序运行在client端 cluster：driver程序运行在某个worker上
--queue	提交大yarn集群使用的队列	--queue test
--num-executors	启动executor个数，默认2，在yarn中使用	--num-executors 100，设置的太多的话，队列可能无法给予充分的资源
--executor-memory	每个executor的内存，默认1G	--executor-memory 10G
--executor-cores	每个executor的核数，在yarn或者standalone下使用	--executor-core 2
--class	程序的主类，主要是Java或scala
--jars	spark依赖的jar，逗号分割	hoodie-hive-0.4.7.jar,hoodie-common-0.4.7.jar
--py-files	依赖的python文件	--py-files test.py
--driver-cores	设置Driver的core个数，默认为1	--driver-cores 2
--driver-memory	设置Driver的内存大小，默认为1G	--driver-memory 5G
--conf key=value	设置spark 属性值	--conf spark.executor.memoryOverhead=4G
--packages	包含在driver 和executor 的 classpath 中的 jar 的 maven 坐标，写法为 `groupId:artifactId:version` 在首次运行的时候会自动下载	org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0