一、Hadoop的安装
1.1 安装前注意事项
- Hadoop运行的前提是已经在本机安装了JDK,配置JAVA_HOME变量
- 在Hadoop中启动多种不同类型的进程,需要配置主机名到IP的映射
- 注意权限:Hadoop框架运行需要产生的很多数据,数据保存目录,必须让当前启动Hadoop进程的用户有写权限
- 关闭防火墙
1.2 Hadoop安装过程
- 进入Hadoop安装包路径下,使用tar -zxvf命令进行解压
- 将Hadoop添加到环境变量中
1.2.1 Hadoop目录结构
- bin目录:存放对Hadoop相关服务(HDFS、YARN)进行操作的脚本
- etc目录:Hadoop的配置文件目录,存放Hadoop的配置文件
- lib目录:存放Hadoop的本地库
- sbin目录:存放启动或停止Hadoop相关服务的脚本
- share目录:存放Hadoop的依赖jar包
二 Hadoop的运行模式
- Hadoop的运行模式包括:本地模式、伪分布式模式以及完全分布式模式
2.1 本地运行模式
- 官方WordCount案例
- 在指定目录下创建wcinput目录,并将所属组和所有者改为hadoop用户
[hadoop@master opt]$ ll
总用量 0
drwxr-xr-x. 2 hadoop hadoop 47 11月 25 13:38 jar-package
drwxrwxrwx. 16 hadoop hadoop 241 11月 25 10:33 mrinput
drwxr-xr-x. 2 hadoop hadoop 67 11月 21 20:04 package
drwxr-xr-x. 2 hadoop hadoop 6 9月 7 2017 rh
drwxr-xr-x. 2 hadoop hadoop 35 11月 22 20:13 shell-script
drwxr-xr-x. 4 hadoop hadoop 46 11月 24 19:38 software
drwxr-xr-x. 2 hadoop hadoop 6 11月 30 15:19 wcinput
- 在wcinput文件下创建一个wc.input文件,并在其中输入一些单词用作测试
- 进入Hadoop根目录下的share文件夹
- 执行程序
[hadoop@master mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.7.2.jar wordcount /opt/wcinput/ /opt/software/output
2.2 伪分布式运行模式
2.2.1 启动HDFS并运行MapReduce程序
- 分析
- 配置集群
- 启动、测试集群
- 执行WordCount案例
- 执行步骤
- 配置集群
配置hadoop-env.sh,修改JAVA_HOME路径
# The java implementation to use.
export JAVA_HOME=/opt/software/jdk1.8.0_121
配置core-site.xml
<!-- 指定HDFS中NameNode的地址 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<!-- 指定Hadoop运行时产生文件的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/software/hadoop-2.7.2/data/tmp</value>
</property>
配置:hdfs-site.xml
<!-- 指定HDFS副本的数量 -->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
- 启动集群
格式化NameNode(第一次启动时候格式化)
[hadoop@master hadoop]$ hadoop namenode -format
启动NameNode
[hadoop@master hadoop]$ hadoop-daemon.sh start namenode
启动DataNode
[hadoop@master hadoop]$ hadoop-daemon.sh start datanode
查看集群
[hadoop@master hadoop]$ jps
注意:在格式化NameNode,会产生新的集群id导致NameNode和DataNode的集群id不一致,集群找不到以往的数据,所以,格式NameNode时候,一定先删除data数据和log日志,然后再格式化NameNode
- 操作集群
在HDFS文件系统上创建一个input文件夹
[hadoop@master hadoop]$ hadoop fs -mkdir /input
[hadoop@master hadoop]$ hadoop fs -ls /
Found 8 items
-rw-r--r-- 3 hadoop supergroup 289927651 2020-11-23 11:14 /Yuri_s_Revenge_1.001.exe
drwxr-xr-x - hadoop supergroup 0 2020-11-25 13:40 /flowbean
drwxr-xr-x - hadoop supergroup 0 2020-11-30 15:50 /input
-rw-r--r-- 3 hadoop supergroup 603940 2020-11-23 10:43 /rename.pdf
drwxr-xr-x - hadoop supergroup 0 2020-11-24 21:03 /test.har
drwxrwx--- - hadoop supergroup 0 2020-11-22 21:31 /tmp
drwxrwxrwx - root root 0 2020-11-25 12:45 /wordcount
-rw-r--r-- 3 hadoop supergroup 10485760 2020-11-23 12:25 /算法笔记.胡凡.pdf
将测试文件上传到文件系统上
[hadoop@master ~]$ hadoop fs -put data.txt /input
[hadoop@master ~]$ hadoop fs -ls /input
Found 1 items
-rw-r--r-- 2 hadoop supergroup 66 2020-11-30 15:52 /input/data.txt
运行MapReduce程序
[hadoop@master ~]$ hadoop jar /opt/software/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /input /output
20/11/30 15:54:52 INFO client.RMProxy: Connecting to ResourceManager at master/172.23.4.221:8032
20/11/30 15:54:52 INFO input.FileInputFormat: Total input paths to process : 1
20/11/30 15:54:52 INFO mapreduce.JobSubmitter: number of splits:1
20/11/30 15:54:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606222715387_0007
20/11/30 15:54:53 INFO impl.YarnClientImpl: Submitted application application_1606222715387_0007
20/11/30 15:54:53 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1606222715387_0007/
20/11/30 15:54:53 INFO mapreduce.Job: Running job: job_1606222715387_0007
20/11/30 15:54:58 INFO mapreduce.Job: Job job_1606222715387_0007 running in uber mode : false
20/11/30 15:54:58 INFO mapreduce.Job: map 0% reduce 0%
20/11/30 15:55:02 INFO mapreduce.Job: map 100% reduce 0%
20/11/30 15:55:06 INFO mapreduce.Job: map 100% reduce 100%
20/11/30 15:55:06 INFO mapreduce.Job: Job job_1606222715387_0007 completed successfully
20/11/30 15:55:06 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=78
FILE: Number of bytes written=235315
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=164
HDFS: Number of bytes written=44
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=1782
Total time spent by all reduces in occupied slots (ms)=2126
Total time spent by all map tasks (ms)=1782
Total time spent by all reduce tasks (ms)=2126
Total vcore-milliseconds taken by all map tasks=1782
Total vcore-milliseconds taken by all reduce tasks=2126
Total megabyte-milliseconds taken by all map tasks=1824768
Total megabyte-milliseconds taken by all reduce tasks=2177024
Map-Reduce Framework
Map input records=4
Map output records=15
Map output bytes=127
Map output materialized bytes=78
Input split bytes=98
Combine input records=15
Combine output records=7
Reduce input groups=7
Reduce shuffle bytes=78
Reduce input records=7
Reduce output records=7
Spilled Records=14
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=100
CPU time spent (ms)=1170
Physical memory (bytes) snapshot=436559872
Virtual memory (bytes) snapshot=4255371264
Total committed heap usage (bytes)=310902784
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=66
File Output Format Counters
Bytes Written=44
查看运行结果
[hadoop@master ~]$ hadoop fs -cat /output/part-r-00000
am 1
fine 2
hello 3
hi 4
i 1
thanks 2
you 2
2.2.2 YARN上运行MapReduce程序
- 配置集群
配置yarn-site.xml
<!-- reducer获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定YARN的ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
配置mapred-site.xml(需要先将mapred-site.xml.template重命名为mapred-site.xml)
<!-- 指定MR运行在YARN上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
- 集群操作(YARN的浏览器界面:http://master:8088/cluster)
删除文件系统上的output文件
[hadoop@master ~]$ hadoop fs -rm -R /output
20/11/30 16:03:43 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /output
执行MapReduce程序
[hadoop@master ~]$ hadoop jar /opt/software/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /input /output
20/11/30 15:54:52 INFO client.RMProxy: Connecting to ResourceManager at master/172.23.4.221:8032
20/11/30 15:54:52 INFO input.FileInputFormat: Total input paths to process : 1
20/11/30 15:54:52 INFO mapreduce.JobSubmitter: number of splits:1
20/11/30 15:54:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606222715387_0007
20/11/30 15:54:53 INFO impl.YarnClientImpl: Submitted application application_1606222715387_0007
20/11/30 15:54:53 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1606222715387_0007/
20/11/30 15:54:53 INFO mapreduce.Job: Running job: job_1606222715387_0007
20/11/30 15:54:58 INFO mapreduce.Job: Job job_1606222715387_0007 running in uber mode : false
20/11/30 15:54:58 INFO mapreduce.Job: map 0% reduce 0%
20/11/30 15:55:02 INFO mapreduce.Job: map 100% reduce 0%
20/11/30 15:55:06 INFO mapreduce.Job: map 100% reduce 100%
20/11/30 15:55:06 INFO mapreduce.Job: Job job_1606222715387_0007 completed successfully
20/11/30 15:55:06 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=78
FILE: Number of bytes written=235315
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=164
HDFS: Number of bytes written=44
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=1782
Total time spent by all reduces in occupied slots (ms)=2126
Total time spent by all map tasks (ms)=1782
Total time spent by all reduce tasks (ms)=2126
Total vcore-milliseconds taken by all map tasks=1782
Total vcore-milliseconds taken by all reduce tasks=2126
Total megabyte-milliseconds taken by all map tasks=1824768
Total megabyte-milliseconds taken by all reduce tasks=2177024
Map-Reduce Framework
Map input records=4
Map output records=15
Map output bytes=127
Map output materialized bytes=78
Input split bytes=98
Combine input records=15
Combine output records=7
Reduce input groups=7
Reduce shuffle bytes=78
Reduce input records=7
Reduce output records=7
Spilled Records=14
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=100
CPU time spent (ms)=1170
Physical memory (bytes) snapshot=436559872
Virtual memory (bytes) snapshot=4255371264
Total committed heap usage (bytes)=310902784
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=66
File Output Format Counters
Bytes Written=44
查看运行结果
[hadoop@master ~]$ hadoop fs -cat /output/part-r-00000
am 1
fine 2
hello 3
hi 4
i 1
thanks 2
you 2
2.2.3 配置历史服务器
- 配置历史服务器是为了查看程序的历史运行情况,浏览器端查看http://master:19888/jobhistory
- 配置mapred-site.xml
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>
<!--第三方框架使用yarn计算的日志聚集功能 -->
<property>
<name>yarn.log.server.url</name>
<value>http://master:19888/jobhistory/logs</value>
</property>
- 启动历史服务器
[hadoop@master ~]$ mr-jobhistory-daemon.sh start historyserver
2.2.4 配置日志的聚集
- 日志聚集:应用运行完成以后,将程序运行日志信息上传到HDFS系统上,用http://master:19888/jobhistory查看
- 日志聚集功能好处:可以方便的查看到程序运行详情,方便开发调试
- 开启日志聚集功能,需要重新启动NodeManager、ResourceManager和HistoryManager
- 配置yarn-site.xml
<!-- 日志聚集功能使能 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 日志保留时间设置7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
2.2.5 配置文件说明
- Hadoop配置文件分两类:默认配置文件和自定义配置文件,只有用户想修改某一配置时,才需要修改自定义配置文件,更改相应属性值。
- 默认配置文件路径
默认文件名 | 文件放在Hadoop的jar包位置 |
core-default.xml | hadoop-common-2.7.2.jar/ core-default.xml |
hdfs-default.xml | hadoop-hdfs-2.7.2.jar/ hdfs-default.xml |
yarn-default.xml | hadoop-yarn-common-2.7.2.jar/ yarn-default.xml |
mapred-default.xml | hadoop-mapreduce-client-core-2.7.2.jar/ mapred-default.xml |
- 自定义配置文件core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml四个配置文件存放在$HADOOP_HOME/etc/hadoop这个路径上,用户可以根据项目需求重新进行修改配置。
2.3 完全分布式运行模式
2.3.1 集群配置
- 核心配置文件:core-site.xml
<configuration>
<!-- 指定Hadoop所使用的文件系统schema(URI),HDFS的NameNode的地址-->
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<!-- 指定hadoop运行时所产生的文件的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/software/hadoop-2.7.2/data/tmp</value>
</property>
</configuration>
- HDFS配置文件
- 配置hadoop-env.sh:修改JAVA_HOME
- 配置hdfs-site.xml
hadoop-env.sh
# The java implementation to use.
export JAVA_HOME=/opt/software/jdk1.8.0_121
hdfs-site.xml
<configuration>
<!-- 指定HDFS副本的数量 -->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.checkpoint.period</name>
<value>360</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>slave01:50090</value>
</property>
</configuration>
- 配置YARN:yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- 指定YARN的ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
<!-- reducer获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 日志聚集功能使用 -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!-- 日志保留时间设置7天 -->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>604800</value>
</property>
</configuration>
- MapReduce配置:mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>
<!--第三方框架使用yarn计算的日志聚集功能 -->
<property>
<name>yarn.log.server.url</name>
<value>http://master:19888/jobhistory/logs</value>
</property>
</configuration>
- 注意:将配置好的文件分发到每一个节点
2.3.2 集群单点启动
同上面伪分布式
2.3.3 配置SSH免密登陆
- 生成密钥
[hadoop@master hadoop]$ ssh-keygen -t rsa
- 将密钥拷贝到需要免密登陆的机器上
[hadoop@master hadoop]$ ssh-copy-id hadoop@slave01
2.3.4 群起集群
- 配置slaves
[hadoop@master hadoop]$ cat slaves
slave01
slave02
slave03
- 启动集群
# 启动hdfs
[hadoop@master sbin]$ start-dfs.sh
# 启动YARN
[hadoop@master sbin]$ start-yarn.sh
注意:NameNode和ResourceManger如果不是同一台机器,不能在NameNode上启动 YARN,应该在ResouceManager所在的机器上启动YARN。
2.3.5 集群启动/停止方式
- 停止HDFS
[hadoop@master sbin]$ stop-dfs.sh
- 停止YARN
[hadoop@master sbin]$ stop-yarn.sh
2.3.6 集群时间同步
- 集群时间同步方式:找一个机器,作为时间服务器,所有的机器与这台集群时间服务器进行定时的同步。
- 配置ntp服务器