大数据学习笔记第3课 基于Yarn的Spark实时计算
- 1、说明
- 2、hadoop单节点运行mapreduce程序
- 3、配置Yarn集群
- 4、使用hadoop Yarn集群运行mapreduce程序
- 5、下载并安装spark
- 5、配置spark使用本地单节点计算圆周率PI
- 6、配置spark使用集群节点计算圆周率PI
- 7、总结
1、说明
本文的测试程序使用的是hadoop官方案例程序,程序所在目录如下(关于mapreduce程序结构与案例源码不在本文范围)
/program/hadoop-3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar
2、hadoop单节点运行mapreduce程序
mapreduce的程序如果计算的数据量很小则不需要使用集群计算,因为启动集群会有额外资源开销计算的效率反而会慢。
1、首先进入hadoop1的终端,然后切换当前目录
cd /program/hadoop-3.3.0/bin
2、使用以下命令查看hadoop官方案例程序的主要功能,如下图:
可以看到一个wordcount,这个功能是统计一组文件中每个单词出现的次数,本文我们就用这个功能做测试。
3、通过以下命令执行mapreduce程序实现对hadoop配置文件中的单词进行统计的功能,并把结果放到output目录下,如下:
./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount file:///program/hadoop-3.3.0/etc/hadoop/* output
执行完毕之后可以通过下图看出实现了对file:///program/hadoop-3.3.0/etc/hadoop/目录下所有文件的所有单词的次数统计。
4、如果重新执行上面的命令计算,则需要修改输出目录,如果不修改会出现目录已存在的异常,如下:
5、可以先在hadoop的图形界面中删除/user目录,如果出现提示权限不够了需要修改权限,修改hadoop文件系统的目录权限的方式如下:
./hadoop fs -chmod -R 777 /
或者
./hdfs dfs -chmod -R 777 /
3、配置Yarn集群
如果计算的数据量很大,则适合使用集群的进行计算,这通常是计算的时间远远大于集群初始化及其他资源分配与管理的时间。要想启用yarn集群,则需要按以下步骤进行配置。
1、先进入hadoop01的终端,切换当前目录如下
cd /program/hadoop-3.3.0/bin
2、通过vim命令编辑…/etc/hadoop/mapred-site.xml,开启yarn集群计算,内容如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<!-- 默认是local 表示不配置走本地多线程计算,yarn表示开启集群计算 -->
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/program/hadoop-3.3.0</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/program/hadoop-3.3.0</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/program/hadoop-3.3.0</value>
</property>
</configuration>
3、通过vim命令编辑…/etc/hadoop/yarn-site.xml,配置resourcemanager,内容如下:
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop01</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
以上信息表示使用hadoop01作为yarn的resourcemanager。
4、通过scp命令把…/etc/hadoop/yarn-site.xml复制到hadoop02和hadoop03节点上,如下图:
5、在启动yarn集群之前,先通过vim命令编辑…/sbin/start-yarn.sh和…/sbin/stop-yarn.sh,在文件顶部添加以下内容:
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
不然在启动的时候会报以下错误:
ERROR: Attempting to operate on yarn resourcemanager as root
ERROR: but there is no YARN_RESOURCEMANAGER_USER defined. Aborting operation.
Starting nodemanagers
ERROR: Attempting to operate on yarn nodemanager as root
ERROR: but there is no YARN_NODEMANAGER_USER defined. Aborting operation.
修改方法如下图:
6、通过…/sbin/start-yarn.sh启动yarn集群,并通过jps查看运行进程,如下图:
hadoop2
hadoop3
7、使用yarn图形界面查看集群
可以通过http://hadoop01:8088进入yarn集群的图形管理界面,如下图:
上图可以看出yarn的集群共24GB内存,24核。
4、使用hadoop Yarn集群运行mapreduce程序
1、使用yarn集群执行mapreduce统计配置文件中的各单词数量,如下:
[root@ecs-ae8a-0001 bin]# ./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount file:///program/hadoop-3.3.0/etc/hadoop/* output2
执行过程输出如下
2021-01-11 10:21:37,218 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at hadoop01/192.168.0.177:8032
2021-01-11 10:21:37,473 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1610330660573_0002
2021-01-11 10:21:37,650 INFO input.FileInputFormat: Total input files to process : 32
2021-01-11 10:21:37,710 INFO mapreduce.JobSubmitter: number of splits:32
2021-01-11 10:21:37,868 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1610330660573_0002
2021-01-11 10:21:37,868 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-01-11 10:21:37,987 INFO conf.Configuration: resource-types.xml not found
2021-01-11 10:21:37,987 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-01-11 10:21:38,033 INFO impl.YarnClientImpl: Submitted application application_1610330660573_0002
2021-01-11 10:21:38,055 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1610330660573_0002/
2021-01-11 10:21:38,055 INFO mapreduce.Job: Running job: job_1610330660573_0002
2021-01-11 10:21:42,158 INFO mapreduce.Job: Job job_1610330660573_0002 running in uber mode : false
2021-01-11 10:21:42,159 INFO mapreduce.Job: map 0% reduce 0%
2021-01-11 10:21:47,225 INFO mapreduce.Job: map 3% reduce 0%
2021-01-11 10:21:48,244 INFO mapreduce.Job: map 6% reduce 0%
2021-01-11 10:21:49,255 INFO mapreduce.Job: map 9% reduce 0%
2021-01-11 10:21:50,260 INFO mapreduce.Job: map 16% reduce 0%
2021-01-11 10:21:51,278 INFO mapreduce.Job: map 19% reduce 0%
2021-01-11 10:21:52,286 INFO mapreduce.Job: map 22% reduce 0%
2021-01-11 10:21:53,305 INFO mapreduce.Job: map 25% reduce 0%
2021-01-11 10:21:54,333 INFO mapreduce.Job: map 28% reduce 0%
2021-01-11 10:21:55,345 INFO mapreduce.Job: map 34% reduce 0%
2021-01-11 10:21:56,350 INFO mapreduce.Job: map 38% reduce 0%
2021-01-11 10:21:57,354 INFO mapreduce.Job: map 41% reduce 0%
2021-01-11 10:21:58,359 INFO mapreduce.Job: map 44% reduce 0%
2021-01-11 10:21:59,364 INFO mapreduce.Job: map 47% reduce 0%
2021-01-11 10:22:00,369 INFO mapreduce.Job: map 50% reduce 0%
2021-01-11 10:22:01,379 INFO mapreduce.Job: map 53% reduce 0%
2021-01-11 10:22:02,405 INFO mapreduce.Job: map 59% reduce 0%
2021-01-11 10:22:03,448 INFO mapreduce.Job: map 63% reduce 0%
2021-01-11 10:22:04,473 INFO mapreduce.Job: map 63% reduce 20%
2021-01-11 10:22:05,487 INFO mapreduce.Job: map 69% reduce 20%
2021-01-11 10:22:06,493 INFO mapreduce.Job: map 72% reduce 20%
2021-01-11 10:22:07,504 INFO mapreduce.Job: map 78% reduce 20%
2021-01-11 10:22:08,511 INFO mapreduce.Job: map 81% reduce 20%
2021-01-11 10:22:09,515 INFO mapreduce.Job: map 84% reduce 20%
2021-01-11 10:22:10,523 INFO mapreduce.Job: map 88% reduce 27%
2021-01-11 10:22:11,527 INFO mapreduce.Job: map 94% reduce 27%
2021-01-11 10:22:12,530 INFO mapreduce.Job: map 100% reduce 27%
2021-01-11 10:22:13,534 INFO mapreduce.Job: map 100% reduce 100%
2021-01-11 10:22:13,539 INFO mapreduce.Job: Job job_1610330660573_0002 completed successfully
2021-01-11 10:22:13,596 INFO mapreduce.Job: Counters: 55
File System Counters
FILE: Number of bytes read=214309
FILE: Number of bytes written=8925679
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=3860
HDFS: Number of bytes written=50478
HDFS: Number of read operations=69
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Killed map tasks=1
Launched map tasks=32
Launched reduce tasks=1
Rack-local map tasks=32
Total time spent by all maps in occupied slots (ms)=59346
Total time spent by all reduces in occupied slots (ms)=22721
Total time spent by all map tasks (ms)=59346
Total time spent by all reduce tasks (ms)=22721
Total vcore-milliseconds taken by all map tasks=59346
Total vcore-milliseconds taken by all reduce tasks=22721
Total megabyte-milliseconds taken by all map tasks=60770304
Total megabyte-milliseconds taken by all reduce tasks=23266304
Map-Reduce Framework
Map input records=2931
Map output records=12842
Map output bytes=158516
Map output materialized bytes=103077
Input split bytes=3860
Combine input records=12842
Combine output records=5739
Reduce input groups=2504
Reduce shuffle bytes=103077
Reduce input records=5739
Reduce output records=2504
Spilled Records=11478
Shuffled Maps =32
Failed Shuffles=0
Merged Map outputs=32
GC time elapsed (ms)=623
CPU time spent (ms)=13090
Physical memory (bytes) snapshot=8035414016
Virtual memory (bytes) snapshot=97197834240
Total committed heap usage (bytes)=6829375488
Peak Map Physical memory (bytes)=252547072
Peak Map Virtual memory (bytes)=2952425472
Peak Reduce Physical memory (bytes)=176435200
Peak Reduce Virtual memory (bytes)=2949914624
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=111418
File Output Format Counters
Bytes Written=50478
[root@ecs-ae8a-0001 bin]#
2、在执行的过程中可以在yarn的图形界面中查看任务的执行进度和信息,如下图:
3、在hadoop的文件系统中看到保存的结果如下图:
5、下载并安装spark
1、首先进入apache官网,地址:http://www.apache.org,如下图:
2、点击中间的Projects链接,能看到子菜单,如下图:
3、选择上图中的Projects->Project List菜单项进入项目列表页面,如下图:
4、在上图中找到S开头的项目,选中Spark,则进入Spark主页面,如下图:
5、点击顶部导航栏的Download链接,或者右侧的Download Spark链接,进入下载页面,如下图:
6、在镜像列表页面中,复制一个镜像地址,这里的镜像地址如下:
https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
7、进入hadoop01终端,并切换当前目录为/opt/soft
cd /opt/soft
8、使用wget命令从镜像地址下载spark到hadoop01节点,如下:
wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz
下载过程如下图:
9、解压缩spark-3.0.1-bin-hadoop3.2.tgz
tar -xzvf spark-3.0.1-bin-hadoop3.2.tgz
10、解压后可以看到在/opt/soft目录下多了spark-3.0.1-bin-hadoop3.2的目录,如下图:
11、把spark-3.0.1-bin-hadoop3.2目录复制到/program目录下(代表安装目录)
cp -rf spark-3.0.1-bin-hadoop3.2 /program/
然后切换当前目录为/program,并查看目录下内容,则看到了spark-3.0.1-bin-hadoop3.2,执行过程如下图:
5、配置spark使用本地单节点计算圆周率PI
1、在hadoop01终端中切换当前目录为/program/spark-3.0.1-bin-hadoop3.2/bin
cd /program/spark-3.0.1-bin-hadoop3.2/bin
2、通过./spark-submit命令本次执行jar程序计算圆周率,命令如下:
./spark-submit --master local --deploy-mode client --executor-memory 2G --executor-cores 2 --class org.apache.spark.examples.SparkPi ../examples/jars/spark-examples_2.12-3.0.1.jar 1000
执行结果如下:
可以通过增加最后一个参数1000值更大的值得到更高精度的PI值。
6、配置spark使用集群节点计算圆周率PI
1、切换当前目录为/program/spark-3.0.1-bin-hadoop3.2/conf
cd /program/spark-3.0.1-bin-hadoop3.2/conf
执行如下图:
2、通过spark-env.sh.template生成spark-env.sh,如下:
cp spark-env.sh.template spark-env.sh
执行如下图:
3、通过vim命令修改spark-env.sh,增加HADOOP_CONF_DIR的配置,就是在spark中指定hadoop配置文件的位置,内容如下:
export HADOOP_CONF_DIR=/program/hadoop-3.3.0/etc/hadoop
具体如下图:
4、保存spark-env.sh后,切换当前目录为/program/spark-3.0.1-bin-hadoop3.2/bin
cd /program/spark-3.0.1-bin-hadoop3.2/bin
5、通过./spark-submit命令调用yarn集群(–master后面的参数由local改为yarn)计算圆周率PI的值
./spark-submit --master yarn --deploy-mode client --executor-memory 2G --executor-cores 2 --class org.apache.spark.examples.SparkPi ../examples/jars/spark-examples_2.12-3.0.1.jar 1000
执行结果如下图:
6、重新执行上面的命令,可以试着调整后面的参数1000为10000看看PI的精度是否提高,如下:
./spark-submit --master yarn --deploy-mode client --executor-memory 2G --executor-cores 2 --class org.apache.spark.examples.SparkPi ../examples/jars/spark-examples_2.12-3.0.1.jar 10000
执行结果如下图:
可以看出执行的结果PI的精度提高了。
7、在执行过程中可以通过yarn的图形界面和spark的图形界面查看执行进度和相关任务信息,如下图:
yarn图形界面如下:
spark图形界面如下:
7、总结
至此一个简单的基于Yarn集群的Spark实时计算环境搭建完毕。希望对初学的朋友能有个参考。