背景:集群部署几种模式
集群模式
一.Standalone模式
1.解压
[root@Linux121 servers]# tar -zxvf spark-2.4.5-bin-without-hadoop-scala-2.12.tgz
2.配置环境变量
export SPARK_HOME=/opt/lagou/servers/spark-2.4.5
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
3.修改配置文件
[root@Linux121 spark-2.4.5]# cd conf/
[root@Linux121 conf]# ls
docker.properties.template metrics.properties.template spark-env.sh.template
fairscheduler.xml.template slaves.template
log4j.properties.template spark-defaults.conf.template
[root@Linux121 conf]# cp spark-env.sh.template spark-env.sh
[root@Linux121 conf]# cp slaves.template slaves
[root@Linux121 conf]# cp spark-defaults.conf.template spark-defaults.conf
[root@Linux121 conf]# cp log4j.properties.template log4j.properties
slaves文件添加所有从节点的主机名
# A Spark Worker will be started on each of the machines listed below.
Linux121
Linux122
Linux123
修改spark-env.sh
export JAVA_HOME=/opt/lagou/servers/jdk1.8.0_301
export HADOOP_HOME=/opt/lagou/servers/hadoop-2.9.2
export HADOOP_CONF_DIR=/opt/lagou/servers/hadoop-2.9.2/etc/hadoop
export SPARK_DIST_CLASSPATH=$(/opt/lagou/servers/hadoop-2.9.2/bin/hadoop classpath)
export SPARK_MASTER_HOST=Linux121
export SPARK_MASTER_PORT=7077
spark-defaults.conf
spark.master spark://Linux121:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://Linux121:9000/Spark-EventLog
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 512m
4.将Spark分发到集群其他节点
scp -r spark-2.4.5/ Linux123:$PWD
并且各个节点添加环境变量(同二)
5.启动集群
[root@Linux121 sbin]# ./start-all.sh
http://linux121:8080/admin/ 查看主界面
由于我们发现start-all.sh在hadoop下也存在因此修改hadoop下的脚本名
[root@Linux121 sbin]# mv start-all.sh start-all-hadoop.sh
[root@Linux121 sbin]# mv stop-all.sh stop-all-hadoop.sh
6.集群测试
注意在Standalone模式下,停止yarn,存在一个资源管理就可以了,避免隐患,如果配置了hdfs,需要启动hdfs.
法一
[root@Linux121 ~]# run-example SparkPi 10
Pi is roughly 3.142891142891143
法二
[root@Linux121 ~]# spark-shell
Spark context Web UI available at http://Linux121:4040
Spark context available as 'sc' (master = spark://Linux121:7077, app id = app-20220216234157-0004).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_301)
Type in expressions to have them evaluated.
Type :help for more information.
scala> 1+2
res0: Int = 3
scala> val lines = sc.textFile("/wcinput/tmp.txt")
lines: org.apache.spark.rdd.RDD[String] = /wcinput/tmp.txt MapPartitionsRDD[5] at textFile at <console>:24
scala> lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect().foreach(println)
(JAVA,5)
(Linux,4)
(hadoop,7)
7.Standalone配置
这个主要是可以根据自己集群节点的资源给spark进行自定义core和内存配置
client模式执行任务
spark-submit --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-2.4.5.jar 2000
cluster模式执行任务
spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster $SPARK_HOME/examples/jars/spark-examples_2.12-2.4.5.jar 2000
Client模式适合调试,Cluster模式适合于生产
8.打开日志server
记住以下文件配置,分发各个节点
spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir hdfs://Linux121:9000/Spark-EventLog
spark.eventLog.compress true
spark-env.sh
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=50 -Dspark.history.fs.logDirectory=hdfs://Linux121:9000/Spark-EventLog"
启动历史服务
start-history-server.sh
http://linux121:18080
9.配置HA高可用
采用zk管理高可用
启动ZK集群
[root@Linux121 shell]# sh zk.sh start
start Linux121 zookeeper server...
ZooKeeper JMX enabled by default
Using config: /opt/lagou/servers/zookeeper-3.4.14/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
start Linux122 zookeeper server...
ZooKeeper JMX enabled by default
Using config: /opt/lagou/servers/zookeeper-3.4.14/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
start Linux123 zookeeper server...
ZooKeeper JMX enabled by default
Using config: /opt/lagou/servers/zookeeper-3.4.14/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
修改配置文件,分发集群各个节点
spark-env.sh
注意注释以下两行配置
#export SPARK_MASTER_HOST=Linux121
#export SPARK_MASTER_PORT=7077
添加以下配置
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=Linux121:2181,Linux122:2181,Linux123:2181 -Dspark.deploy.zookeeper.dir=/spark"
重新启动Spark
我们通过测试发现实现了高可用
查看zk中保存的信息
[root@Linux121 conf]# zkCli.sh
[zk: localhost:2181(CONNECTED) 4] ls /
[mysql-info, zookeeper, overseer, zkServer, yarn-leader-election, hadoop-ha, security.json, myKafka, order_server, zk-test0000000000, aliases.json, live_nodes, collections, overseer_elect, spark, rmstore, clusterstate.json, hbase]
[zk: localhost:2181(CONNECTED) 6] ls /spark
[leader_election, master_status]
二.Yarn模式
1.准备工作:停止Standalone模式下服务,启动yarn
[root@Linux121 shell]# stop-all.sh
[root@Linux121 shell]# sh zk.sh stop
[root@Linux121 shell]# stop-history-server.sh
[root@Linux123 conf]# start-yarn.sh
2.特殊场景配置(虚拟机)应对资源不充裕
以下配置关闭资源检查,分发各个节点
[root@Linux121 hadoop]# vim yarn-site.xml
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
3.修改配置,分发各个节点
必须配置项
[root@Linux121 conf]# vim spark-env.sh
export HADOOP_CONF_DIR=/opt/lagou/servers/hadoop-2.9.2/etc/hadoop
优化相关配置
spark-defaults.conf
spark.yarn.historyServer.address Linux121:18080
spark.yarn.jars hdfs:///spark-yarn/jars/*.jar
根据配置创建hdfs目录并且上传jar
[root@Linux121 conf]# hdfs dfs -mkdir -p /spark-yarn/jars/
/spark-2.4.5/jars/
[root@Linux121 jars]# hdfs dfs -put * /spark-yarn/jars/
测试Client运行模式
[root@Linux121 jars]# spark-submit --master yarn --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-2.4.5.jar 2000
测试Cluster运行模式
[root@Linux121 jars]# spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-2.4.5.jar 2000
通过配置我们发现spark强大之处既可以用Standalone模式下自己自带的资源调度也可以用yarn
三.Standalone或者Yarn模式整合日志服务
添加配置,分发各个节点
[root@Linux121 conf]# vim spark-defaults.conf
spark.yarn.historyServer.address Linux121:18080
spark.history.ui.port 18080
测试任务查看日志
spark-submit --master yarn --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-2.4.5.jar 2000
本地模式
[root@Linux121 ~]# spark-shell --master local[*]//代表按照核数启动线程
注意此刻可以在没有hdfs情况下启动,需要注释spark-defaults.conf
#spark.eventLog.enabled true
#spark.eventLog.dir hdfs://Linux121:9000/Spark-EventLog
成功启动如下:
[root@Linux121 conf]# spark-shell --master local[*]
22/02/17 22:13:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://Linux121:4040
Spark context available as 'sc' (master = local[*], app id = local-1645107251883).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_301)
Type in expressions to have them evaluated.
Type :help for more information.
scala> 1+2
res0: Int = 3
scala>
根据log提示了对应的UI界面地址
http://Linux121:4040
本地模式随便找个从节点都能执行
[root@Linux121 conf]# jps
3315 RunJar
12316 SparkSubmit //相关进程都在这个jvm
12540 Jps
测试执行本地任务
scala> val lines = sc.textFile("file:///root/wc.txt")
lines: org.apache.spark.rdd.RDD[String] = file:///root/wc.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect().foreach(println)
22/02/17 22:31:42 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
(lagou,3)
(hadoop,2)
(mapreduce,3)
(yarn,2)
(hdfs,1)
伪分布式模式
[root@Linux121 conf]# spark-shell --master local-cluster[4,2,1024]
22/02/17 22:53:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://Linux121:4040
Spark context available as 'sc' (master = local-cluster[4,2,1024], app id = app-20220217225338-0000).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.5
/_/
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_301)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
[root@Linux121 ~]# jps
3315 RunJar
13363 SparkSubmit //相当于资源管理器,Driver管理整个应用程序等
13527 CoarseGrainedExecutorBackend //粗粒度Executor相当于worker
13544 CoarseGrainedExecutorBackend
13674 Jps
13531 CoarseGrainedExecutorBackend
13532 CoarseGrainedExecutorBackend
[root@Linux121 ~]#
伪分布式测试
[root@Linux121 conf]# spark-submit --master local-cluster[4,2,1024] --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-2.4.5.jar 10
22/02/17 23:13:12 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 90.302653 s
Pi is roughly 3.1391871391871393
22/02/17 23:13:12 INFO SparkUI: Stopped Spark web UI at http://Linux121:4040
22/02/17 23:13:12 INFO StandaloneSchedulerBackend: Shutting down all executors
22/02/17 23:13:12 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down