大数据学习笔记第3课 基于Yarn的Spark实时计算

  • 1、说明
  • 2、hadoop单节点运行mapreduce程序
  • 3、配置Yarn集群
  • 4、使用hadoop Yarn集群运行mapreduce程序
  • 5、下载并安装spark
  • 5、配置spark使用本地单节点计算圆周率PI
  • 6、配置spark使用集群节点计算圆周率PI
  • 7、总结


1、说明

本文的测试程序使用的是hadoop官方案例程序,程序所在目录如下(关于mapreduce程序结构与案例源码不在本文范围)

/program/hadoop-3.3.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar

2、hadoop单节点运行mapreduce程序

mapreduce的程序如果计算的数据量很小则不需要使用集群计算,因为启动集群会有额外资源开销计算的效率反而会慢。
1、首先进入hadoop1的终端,然后切换当前目录

cd /program/hadoop-3.3.0/bin

2、使用以下命令查看hadoop官方案例程序的主要功能,如下图:

Hadoop原生的计数器 hadoop 实时计算_Hadoop原生的计数器


可以看到一个wordcount,这个功能是统计一组文件中每个单词出现的次数,本文我们就用这个功能做测试。

3、通过以下命令执行mapreduce程序实现对hadoop配置文件中的单词进行统计的功能,并把结果放到output目录下,如下:

./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount file:///program/hadoop-3.3.0/etc/hadoop/* output

执行完毕之后可以通过下图看出实现了对file:///program/hadoop-3.3.0/etc/hadoop/目录下所有文件的所有单词的次数统计。

Hadoop原生的计数器 hadoop 实时计算_Hadoop原生的计数器_02

Hadoop原生的计数器 hadoop 实时计算_spark_03


4、如果重新执行上面的命令计算,则需要修改输出目录,如果不修改会出现目录已存在的异常,如下:

Hadoop原生的计数器 hadoop 实时计算_hadoop_04


5、可以先在hadoop的图形界面中删除/user目录,如果出现提示权限不够了需要修改权限,修改hadoop文件系统的目录权限的方式如下:

./hadoop fs -chmod -R 777 /
或者
./hdfs dfs -chmod -R 777 /

3、配置Yarn集群

如果计算的数据量很大,则适合使用集群的进行计算,这通常是计算的时间远远大于集群初始化及其他资源分配与管理的时间。要想启用yarn集群,则需要按以下步骤进行配置。
1、先进入hadoop01的终端,切换当前目录如下

cd /program/hadoop-3.3.0/bin

2、通过vim命令编辑…/etc/hadoop/mapred-site.xml,开启yarn集群计算,内容如下:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <!-- 默认是local 表示不配置走本地多线程计算,yarn表示开启集群计算 -->
                <value>yarn</value>
        </property>
        <property>
                <name>yarn.app.mapreduce.am.env</name>
                <value>HADOOP_MAPRED_HOME=/program/hadoop-3.3.0</value>
        </property>
        <property>
                <name>mapreduce.map.env</name>
                <value>HADOOP_MAPRED_HOME=/program/hadoop-3.3.0</value>
        </property>
        <property>
                <name>mapreduce.reduce.env</name>
                <value>HADOOP_MAPRED_HOME=/program/hadoop-3.3.0</value>
        </property>
</configuration>

3、通过vim命令编辑…/etc/hadoop/yarn-site.xml,配置resourcemanager,内容如下:

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>hadoop01</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
</configuration>

以上信息表示使用hadoop01作为yarn的resourcemanager。

4、通过scp命令把…/etc/hadoop/yarn-site.xml复制到hadoop02和hadoop03节点上,如下图:

Hadoop原生的计数器 hadoop 实时计算_mapreduce_05


5、在启动yarn集群之前,先通过vim命令编辑…/sbin/start-yarn.sh和…/sbin/stop-yarn.sh,在文件顶部添加以下内容:

YARN_RESOURCEMANAGER_USER=root

HADOOP_SECURE_DN_USER=yarn

YARN_NODEMANAGER_USER=root

不然在启动的时候会报以下错误:

ERROR: Attempting to operate on yarn resourcemanager as root
ERROR: but there is no YARN_RESOURCEMANAGER_USER defined. Aborting operation.
Starting nodemanagers
ERROR: Attempting to operate on yarn nodemanager as root
ERROR: but there is no YARN_NODEMANAGER_USER defined. Aborting operation.

修改方法如下图:

Hadoop原生的计数器 hadoop 实时计算_mapreduce_06


6、通过…/sbin/start-yarn.sh启动yarn集群,并通过jps查看运行进程,如下图:

Hadoop原生的计数器 hadoop 实时计算_大数据_07


hadoop2

Hadoop原生的计数器 hadoop 实时计算_Hadoop原生的计数器_08


hadoop3

Hadoop原生的计数器 hadoop 实时计算_mapreduce_09


7、使用yarn图形界面查看集群

可以通过http://hadoop01:8088进入yarn集群的图形管理界面,如下图:

Hadoop原生的计数器 hadoop 实时计算_mapreduce_10


上图可以看出yarn的集群共24GB内存,24核。

4、使用hadoop Yarn集群运行mapreduce程序

1、使用yarn集群执行mapreduce统计配置文件中的各单词数量,如下:

[root@ecs-ae8a-0001 bin]# ./hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar wordcount file:///program/hadoop-3.3.0/etc/hadoop/* output2

执行过程输出如下

2021-01-11 10:21:37,218 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at hadoop01/192.168.0.177:8032
2021-01-11 10:21:37,473 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1610330660573_0002
2021-01-11 10:21:37,650 INFO input.FileInputFormat: Total input files to process : 32
2021-01-11 10:21:37,710 INFO mapreduce.JobSubmitter: number of splits:32
2021-01-11 10:21:37,868 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1610330660573_0002
2021-01-11 10:21:37,868 INFO mapreduce.JobSubmitter: Executing with tokens: []
2021-01-11 10:21:37,987 INFO conf.Configuration: resource-types.xml not found
2021-01-11 10:21:37,987 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2021-01-11 10:21:38,033 INFO impl.YarnClientImpl: Submitted application application_1610330660573_0002
2021-01-11 10:21:38,055 INFO mapreduce.Job: The url to track the job: http://hadoop01:8088/proxy/application_1610330660573_0002/
2021-01-11 10:21:38,055 INFO mapreduce.Job: Running job: job_1610330660573_0002
2021-01-11 10:21:42,158 INFO mapreduce.Job: Job job_1610330660573_0002 running in uber mode : false
2021-01-11 10:21:42,159 INFO mapreduce.Job:  map 0% reduce 0%
2021-01-11 10:21:47,225 INFO mapreduce.Job:  map 3% reduce 0%
2021-01-11 10:21:48,244 INFO mapreduce.Job:  map 6% reduce 0%
2021-01-11 10:21:49,255 INFO mapreduce.Job:  map 9% reduce 0%
2021-01-11 10:21:50,260 INFO mapreduce.Job:  map 16% reduce 0%
2021-01-11 10:21:51,278 INFO mapreduce.Job:  map 19% reduce 0%
2021-01-11 10:21:52,286 INFO mapreduce.Job:  map 22% reduce 0%
2021-01-11 10:21:53,305 INFO mapreduce.Job:  map 25% reduce 0%
2021-01-11 10:21:54,333 INFO mapreduce.Job:  map 28% reduce 0%
2021-01-11 10:21:55,345 INFO mapreduce.Job:  map 34% reduce 0%
2021-01-11 10:21:56,350 INFO mapreduce.Job:  map 38% reduce 0%
2021-01-11 10:21:57,354 INFO mapreduce.Job:  map 41% reduce 0%
2021-01-11 10:21:58,359 INFO mapreduce.Job:  map 44% reduce 0%
2021-01-11 10:21:59,364 INFO mapreduce.Job:  map 47% reduce 0%
2021-01-11 10:22:00,369 INFO mapreduce.Job:  map 50% reduce 0%
2021-01-11 10:22:01,379 INFO mapreduce.Job:  map 53% reduce 0%
2021-01-11 10:22:02,405 INFO mapreduce.Job:  map 59% reduce 0%
2021-01-11 10:22:03,448 INFO mapreduce.Job:  map 63% reduce 0%
2021-01-11 10:22:04,473 INFO mapreduce.Job:  map 63% reduce 20%
2021-01-11 10:22:05,487 INFO mapreduce.Job:  map 69% reduce 20%
2021-01-11 10:22:06,493 INFO mapreduce.Job:  map 72% reduce 20%
2021-01-11 10:22:07,504 INFO mapreduce.Job:  map 78% reduce 20%
2021-01-11 10:22:08,511 INFO mapreduce.Job:  map 81% reduce 20%
2021-01-11 10:22:09,515 INFO mapreduce.Job:  map 84% reduce 20%
2021-01-11 10:22:10,523 INFO mapreduce.Job:  map 88% reduce 27%
2021-01-11 10:22:11,527 INFO mapreduce.Job:  map 94% reduce 27%
2021-01-11 10:22:12,530 INFO mapreduce.Job:  map 100% reduce 27%
2021-01-11 10:22:13,534 INFO mapreduce.Job:  map 100% reduce 100%
2021-01-11 10:22:13,539 INFO mapreduce.Job: Job job_1610330660573_0002 completed successfully
2021-01-11 10:22:13,596 INFO mapreduce.Job: Counters: 55
	File System Counters
		FILE: Number of bytes read=214309
		FILE: Number of bytes written=8925679
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=3860
		HDFS: Number of bytes written=50478
		HDFS: Number of read operations=69
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
		HDFS: Number of bytes read erasure-coded=0
	Job Counters 
		Killed map tasks=1
		Launched map tasks=32
		Launched reduce tasks=1
		Rack-local map tasks=32
		Total time spent by all maps in occupied slots (ms)=59346
		Total time spent by all reduces in occupied slots (ms)=22721
		Total time spent by all map tasks (ms)=59346
		Total time spent by all reduce tasks (ms)=22721
		Total vcore-milliseconds taken by all map tasks=59346
		Total vcore-milliseconds taken by all reduce tasks=22721
		Total megabyte-milliseconds taken by all map tasks=60770304
		Total megabyte-milliseconds taken by all reduce tasks=23266304
	Map-Reduce Framework
		Map input records=2931
		Map output records=12842
		Map output bytes=158516
		Map output materialized bytes=103077
		Input split bytes=3860
		Combine input records=12842
		Combine output records=5739
		Reduce input groups=2504
		Reduce shuffle bytes=103077
		Reduce input records=5739
		Reduce output records=2504
		Spilled Records=11478
		Shuffled Maps =32
		Failed Shuffles=0
		Merged Map outputs=32
		GC time elapsed (ms)=623
		CPU time spent (ms)=13090
		Physical memory (bytes) snapshot=8035414016
		Virtual memory (bytes) snapshot=97197834240
		Total committed heap usage (bytes)=6829375488
		Peak Map Physical memory (bytes)=252547072
		Peak Map Virtual memory (bytes)=2952425472
		Peak Reduce Physical memory (bytes)=176435200
		Peak Reduce Virtual memory (bytes)=2949914624
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=111418
	File Output Format Counters 
		Bytes Written=50478
[root@ecs-ae8a-0001 bin]#

2、在执行的过程中可以在yarn的图形界面中查看任务的执行进度和信息,如下图:

Hadoop原生的计数器 hadoop 实时计算_hadoop_11


3、在hadoop的文件系统中看到保存的结果如下图:

Hadoop原生的计数器 hadoop 实时计算_mapreduce_12


Hadoop原生的计数器 hadoop 实时计算_Hadoop原生的计数器_13

5、下载并安装spark

1、首先进入apache官网,地址:http://www.apache.org,如下图:

Hadoop原生的计数器 hadoop 实时计算_hadoop_14


2、点击中间的Projects链接,能看到子菜单,如下图:

Hadoop原生的计数器 hadoop 实时计算_spark_15


3、选择上图中的Projects->Project List菜单项进入项目列表页面,如下图:

Hadoop原生的计数器 hadoop 实时计算_Hadoop原生的计数器_16


4、在上图中找到S开头的项目,选中Spark,则进入Spark主页面,如下图:

Hadoop原生的计数器 hadoop 实时计算_mapreduce_17


5、点击顶部导航栏的Download链接,或者右侧的Download Spark链接,进入下载页面,如下图:

Hadoop原生的计数器 hadoop 实时计算_hadoop_18


6、在镜像列表页面中,复制一个镜像地址,这里的镜像地址如下:

https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz

7、进入hadoop01终端,并切换当前目录为/opt/soft

cd /opt/soft

8、使用wget命令从镜像地址下载spark到hadoop01节点,如下:

wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz

下载过程如下图:

Hadoop原生的计数器 hadoop 实时计算_大数据_19


9、解压缩spark-3.0.1-bin-hadoop3.2.tgz

tar -xzvf spark-3.0.1-bin-hadoop3.2.tgz

10、解压后可以看到在/opt/soft目录下多了spark-3.0.1-bin-hadoop3.2的目录,如下图:

Hadoop原生的计数器 hadoop 实时计算_hadoop_20


11、把spark-3.0.1-bin-hadoop3.2目录复制到/program目录下(代表安装目录)

cp -rf spark-3.0.1-bin-hadoop3.2 /program/

然后切换当前目录为/program,并查看目录下内容,则看到了spark-3.0.1-bin-hadoop3.2,执行过程如下图:

Hadoop原生的计数器 hadoop 实时计算_spark_21

5、配置spark使用本地单节点计算圆周率PI

1、在hadoop01终端中切换当前目录为/program/spark-3.0.1-bin-hadoop3.2/bin

cd /program/spark-3.0.1-bin-hadoop3.2/bin

2、通过./spark-submit命令本次执行jar程序计算圆周率,命令如下:

./spark-submit --master local --deploy-mode client --executor-memory 2G --executor-cores 2 --class org.apache.spark.examples.SparkPi ../examples/jars/spark-examples_2.12-3.0.1.jar 1000

执行结果如下:

Hadoop原生的计数器 hadoop 实时计算_hadoop_22


可以通过增加最后一个参数1000值更大的值得到更高精度的PI值。

6、配置spark使用集群节点计算圆周率PI

1、切换当前目录为/program/spark-3.0.1-bin-hadoop3.2/conf

cd /program/spark-3.0.1-bin-hadoop3.2/conf

执行如下图:

Hadoop原生的计数器 hadoop 实时计算_mapreduce_23


2、通过spark-env.sh.template生成spark-env.sh,如下:

cp spark-env.sh.template spark-env.sh

执行如下图:

Hadoop原生的计数器 hadoop 实时计算_Hadoop原生的计数器_24


3、通过vim命令修改spark-env.sh,增加HADOOP_CONF_DIR的配置,就是在spark中指定hadoop配置文件的位置,内容如下:

export HADOOP_CONF_DIR=/program/hadoop-3.3.0/etc/hadoop

具体如下图:

Hadoop原生的计数器 hadoop 实时计算_大数据_25

4、保存spark-env.sh后,切换当前目录为/program/spark-3.0.1-bin-hadoop3.2/bin

cd /program/spark-3.0.1-bin-hadoop3.2/bin

5、通过./spark-submit命令调用yarn集群(–master后面的参数由local改为yarn)计算圆周率PI的值

./spark-submit --master yarn --deploy-mode client --executor-memory 2G --executor-cores 2 --class org.apache.spark.examples.SparkPi ../examples/jars/spark-examples_2.12-3.0.1.jar 1000

执行结果如下图:

Hadoop原生的计数器 hadoop 实时计算_大数据_26


6、重新执行上面的命令,可以试着调整后面的参数1000为10000看看PI的精度是否提高,如下:

./spark-submit --master yarn --deploy-mode client --executor-memory 2G --executor-cores 2 --class org.apache.spark.examples.SparkPi ../examples/jars/spark-examples_2.12-3.0.1.jar 10000

执行结果如下图:

Hadoop原生的计数器 hadoop 实时计算_Hadoop原生的计数器_27


可以看出执行的结果PI的精度提高了。

7、在执行过程中可以通过yarn的图形界面和spark的图形界面查看执行进度和相关任务信息,如下图:

yarn图形界面如下:

Hadoop原生的计数器 hadoop 实时计算_大数据_28


spark图形界面如下:

Hadoop原生的计数器 hadoop 实时计算_mapreduce_29

7、总结

至此一个简单的基于Yarn集群的Spark实时计算环境搭建完毕。希望对初学的朋友能有个参考。