一、Hadoop的安装

1.1 安装前注意事项

  • Hadoop运行的前提是已经在本机安装了JDK,配置JAVA_HOME变量
  • 在Hadoop中启动多种不同类型的进程,需要配置主机名到IP的映射
  • 注意权限:Hadoop框架运行需要产生的很多数据,数据保存目录,必须让当前启动Hadoop进程的用户有写权限
  • 关闭防火墙

1.2 Hadoop安装过程

  • 进入Hadoop安装包路径下,使用tar -zxvf命令进行解压
  • 将Hadoop添加到环境变量中

Hadoop本机部署 hadoop安装部署文档_Hadoop

1.2.1 Hadoop目录结构

Hadoop本机部署 hadoop安装部署文档_Hadoop_02

  • bin目录:存放对Hadoop相关服务(HDFS、YARN)进行操作的脚本
  • etc目录:Hadoop的配置文件目录,存放Hadoop的配置文件
  • lib目录:存放Hadoop的本地库
  • sbin目录:存放启动或停止Hadoop相关服务的脚本
  • share目录:存放Hadoop的依赖jar包

二 Hadoop的运行模式

  • Hadoop的运行模式包括:本地模式、伪分布式模式以及完全分布式模式

2.1 本地运行模式

  • 官方WordCount案例
  • 在指定目录下创建wcinput目录,并将所属组和所有者改为hadoop用户
[hadoop@master opt]$ ll
总用量 0
drwxr-xr-x.  2 hadoop hadoop  47 11月 25 13:38 jar-package
drwxrwxrwx. 16 hadoop hadoop 241 11月 25 10:33 mrinput
drwxr-xr-x.  2 hadoop hadoop  67 11月 21 20:04 package
drwxr-xr-x.  2 hadoop hadoop   6 9月   7 2017 rh
drwxr-xr-x.  2 hadoop hadoop  35 11月 22 20:13 shell-script
drwxr-xr-x.  4 hadoop hadoop  46 11月 24 19:38 software
drwxr-xr-x.  2 hadoop hadoop   6 11月 30 15:19 wcinput
  • 在wcinput文件下创建一个wc.input文件,并在其中输入一些单词用作测试
  • 进入Hadoop根目录下的share文件夹
  • 执行程序
[hadoop@master mapreduce]$ hadoop jar hadoop-mapreduce-examples-2.7.2.jar wordcount /opt/wcinput/ /opt/software/output

2.2 伪分布式运行模式

2.2.1 启动HDFS并运行MapReduce程序

  • 分析
  • 配置集群
  • 启动、测试集群
  • 执行WordCount案例
  • 执行步骤
  • 配置集群

配置hadoop-env.sh,修改JAVA_HOME路径

# The java implementation to use.
export JAVA_HOME=/opt/software/jdk1.8.0_121

配置core-site.xml

<!-- 指定HDFS中NameNode的地址 -->
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
</property>

<!-- 指定Hadoop运行时产生文件的存储目录 -->
<property>
	<name>hadoop.tmp.dir</name>
	<value>/opt/software/hadoop-2.7.2/data/tmp</value>
</property>

配置:hdfs-site.xml

<!-- 指定HDFS副本的数量 -->
<property>
	<name>dfs.replication</name>
	<value>1</value>
</property>
  • 启动集群

格式化NameNode(第一次启动时候格式化)

[hadoop@master hadoop]$ hadoop namenode -format

启动NameNode

[hadoop@master hadoop]$ hadoop-daemon.sh start namenode

启动DataNode

[hadoop@master hadoop]$ hadoop-daemon.sh start datanode

查看集群

[hadoop@master hadoop]$ jps

注意:在格式化NameNode,会产生新的集群id导致NameNode和DataNode的集群id不一致,集群找不到以往的数据,所以,格式NameNode时候,一定先删除data数据和log日志,然后再格式化NameNode

  • 操作集群

在HDFS文件系统上创建一个input文件夹

[hadoop@master hadoop]$ hadoop fs -mkdir /input
[hadoop@master hadoop]$ hadoop fs -ls /
Found 8 items
-rw-r--r--   3 hadoop supergroup  289927651 2020-11-23 11:14 /Yuri_s_Revenge_1.001.exe
drwxr-xr-x   - hadoop supergroup          0 2020-11-25 13:40 /flowbean
drwxr-xr-x   - hadoop supergroup          0 2020-11-30 15:50 /input
-rw-r--r--   3 hadoop supergroup     603940 2020-11-23 10:43 /rename.pdf
drwxr-xr-x   - hadoop supergroup          0 2020-11-24 21:03 /test.har
drwxrwx---   - hadoop supergroup          0 2020-11-22 21:31 /tmp
drwxrwxrwx   - root   root                0 2020-11-25 12:45 /wordcount
-rw-r--r--   3 hadoop supergroup   10485760 2020-11-23 12:25 /算法笔记.胡凡.pdf

将测试文件上传到文件系统上

[hadoop@master ~]$ hadoop fs -put data.txt /input
[hadoop@master ~]$ hadoop fs -ls /input
Found 1 items
-rw-r--r--   2 hadoop supergroup         66 2020-11-30 15:52 /input/data.txt

运行MapReduce程序

[hadoop@master ~]$ hadoop jar /opt/software/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /input /output
20/11/30 15:54:52 INFO client.RMProxy: Connecting to ResourceManager at master/172.23.4.221:8032
20/11/30 15:54:52 INFO input.FileInputFormat: Total input paths to process : 1
20/11/30 15:54:52 INFO mapreduce.JobSubmitter: number of splits:1
20/11/30 15:54:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606222715387_0007
20/11/30 15:54:53 INFO impl.YarnClientImpl: Submitted application application_1606222715387_0007
20/11/30 15:54:53 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1606222715387_0007/
20/11/30 15:54:53 INFO mapreduce.Job: Running job: job_1606222715387_0007
20/11/30 15:54:58 INFO mapreduce.Job: Job job_1606222715387_0007 running in uber mode : false
20/11/30 15:54:58 INFO mapreduce.Job:  map 0% reduce 0%
20/11/30 15:55:02 INFO mapreduce.Job:  map 100% reduce 0%
20/11/30 15:55:06 INFO mapreduce.Job:  map 100% reduce 100%
20/11/30 15:55:06 INFO mapreduce.Job: Job job_1606222715387_0007 completed successfully
20/11/30 15:55:06 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=78
                FILE: Number of bytes written=235315
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=164
                HDFS: Number of bytes written=44
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=1782
                Total time spent by all reduces in occupied slots (ms)=2126
                Total time spent by all map tasks (ms)=1782
                Total time spent by all reduce tasks (ms)=2126
                Total vcore-milliseconds taken by all map tasks=1782
                Total vcore-milliseconds taken by all reduce tasks=2126
                Total megabyte-milliseconds taken by all map tasks=1824768
                Total megabyte-milliseconds taken by all reduce tasks=2177024
        Map-Reduce Framework
                Map input records=4
                Map output records=15
                Map output bytes=127
                Map output materialized bytes=78
                Input split bytes=98
                Combine input records=15
                Combine output records=7
                Reduce input groups=7
                Reduce shuffle bytes=78
                Reduce input records=7
                Reduce output records=7
                Spilled Records=14
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=100
                CPU time spent (ms)=1170
                Physical memory (bytes) snapshot=436559872
                Virtual memory (bytes) snapshot=4255371264
                Total committed heap usage (bytes)=310902784
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=66
        File Output Format Counters 
                Bytes Written=44

查看运行结果

[hadoop@master ~]$ hadoop fs -cat /output/part-r-00000
am      1
fine    2
hello   3
hi      4
i       1
thanks  2
you     2

2.2.2 YARN上运行MapReduce程序

  • 配置集群

配置yarn-site.xml

<!-- reducer获取数据的方式 -->
<property>
 		<name>yarn.nodemanager.aux-services</name>
 		<value>mapreduce_shuffle</value>
</property>

<!-- 指定YARN的ResourceManager的地址 -->
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master</value>
</property>

配置mapred-site.xml(需要先将mapred-site.xml.template重命名为mapred-site.xml)

<!-- 指定MR运行在YARN上 -->
<property>
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
</property>
  • 集群操作(YARN的浏览器界面:http://master:8088/cluster)

删除文件系统上的output文件

[hadoop@master ~]$ hadoop fs -rm -R /output
20/11/30 16:03:43 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /output

执行MapReduce程序

[hadoop@master ~]$ hadoop jar /opt/software/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /input /output
20/11/30 15:54:52 INFO client.RMProxy: Connecting to ResourceManager at master/172.23.4.221:8032
20/11/30 15:54:52 INFO input.FileInputFormat: Total input paths to process : 1
20/11/30 15:54:52 INFO mapreduce.JobSubmitter: number of splits:1
20/11/30 15:54:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606222715387_0007
20/11/30 15:54:53 INFO impl.YarnClientImpl: Submitted application application_1606222715387_0007
20/11/30 15:54:53 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1606222715387_0007/
20/11/30 15:54:53 INFO mapreduce.Job: Running job: job_1606222715387_0007
20/11/30 15:54:58 INFO mapreduce.Job: Job job_1606222715387_0007 running in uber mode : false
20/11/30 15:54:58 INFO mapreduce.Job:  map 0% reduce 0%
20/11/30 15:55:02 INFO mapreduce.Job:  map 100% reduce 0%
20/11/30 15:55:06 INFO mapreduce.Job:  map 100% reduce 100%
20/11/30 15:55:06 INFO mapreduce.Job: Job job_1606222715387_0007 completed successfully
20/11/30 15:55:06 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=78
                FILE: Number of bytes written=235315
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=164
                HDFS: Number of bytes written=44
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=1782
                Total time spent by all reduces in occupied slots (ms)=2126
                Total time spent by all map tasks (ms)=1782
                Total time spent by all reduce tasks (ms)=2126
                Total vcore-milliseconds taken by all map tasks=1782
                Total vcore-milliseconds taken by all reduce tasks=2126
                Total megabyte-milliseconds taken by all map tasks=1824768
                Total megabyte-milliseconds taken by all reduce tasks=2177024
        Map-Reduce Framework
                Map input records=4
                Map output records=15
                Map output bytes=127
                Map output materialized bytes=78
                Input split bytes=98
                Combine input records=15
                Combine output records=7
                Reduce input groups=7
                Reduce shuffle bytes=78
                Reduce input records=7
                Reduce output records=7
                Spilled Records=14
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=100
                CPU time spent (ms)=1170
                Physical memory (bytes) snapshot=436559872
                Virtual memory (bytes) snapshot=4255371264
                Total committed heap usage (bytes)=310902784
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=66
        File Output Format Counters 
                Bytes Written=44

查看运行结果

[hadoop@master ~]$ hadoop fs -cat /output/part-r-00000
am      1
fine    2
hello   3
hi      4
i       1
thanks  2
you     2

2.2.3 配置历史服务器

  • 配置历史服务器是为了查看程序的历史运行情况,浏览器端查看http://master:19888/jobhistory
  • 配置mapred-site.xml
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>master:10020</value>
</property>
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>master:19888</value>
</property>
<!--第三方框架使用yarn计算的日志聚集功能 -->
<property>         
    <name>yarn.log.server.url</name>                                                   
    <value>http://master:19888/jobhistory/logs</value> 
</property>
  • 启动历史服务器
[hadoop@master ~]$ mr-jobhistory-daemon.sh start historyserver

2.2.4 配置日志的聚集

  • 日志聚集:应用运行完成以后,将程序运行日志信息上传到HDFS系统上,用http://master:19888/jobhistory查看
  • 日志聚集功能好处:可以方便的查看到程序运行详情,方便开发调试
  • 开启日志聚集功能,需要重新启动NodeManager、ResourceManager和HistoryManager
  • 配置yarn-site.xml
<!-- 日志聚集功能使能 -->
<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>

<!-- 日志保留时间设置7天 -->
<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>604800</value>
</property>

2.2.5 配置文件说明

  • Hadoop配置文件分两类:默认配置文件和自定义配置文件,只有用户想修改某一配置时,才需要修改自定义配置文件,更改相应属性值。
  • 默认配置文件路径

默认文件名

文件放在Hadoop的jar包位置

core-default.xml

hadoop-common-2.7.2.jar/ core-default.xml

hdfs-default.xml

hadoop-hdfs-2.7.2.jar/ hdfs-default.xml

yarn-default.xml

hadoop-yarn-common-2.7.2.jar/ yarn-default.xml

mapred-default.xml

hadoop-mapreduce-client-core-2.7.2.jar/ mapred-default.xml

  • 自定义配置文件core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml四个配置文件存放在$HADOOP_HOME/etc/hadoop这个路径上,用户可以根据项目需求重新进行修改配置。

2.3 完全分布式运行模式

2.3.1 集群配置

  • 核心配置文件:core-site.xml
<configuration>
        <!-- 指定Hadoop所使用的文件系统schema(URI),HDFS的NameNode的地址-->
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://master:9000</value>
        </property>

        <!-- 指定hadoop运行时所产生的文件的存储目录 -->
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/opt/software/hadoop-2.7.2/data/tmp</value>
        </property>

</configuration>
  • HDFS配置文件
  • 配置hadoop-env.sh:修改JAVA_HOME
  • 配置hdfs-site.xml

hadoop-env.sh

# The java implementation to use.
export JAVA_HOME=/opt/software/jdk1.8.0_121

hdfs-site.xml

<configuration>
        <!-- 指定HDFS副本的数量 -->
        <property>
                <name>dfs.replication</name>
                <value>2</value>
        </property>
        <property>
                <name>dfs.namenode.checkpoint.period</name>
                <value>360</value>
        </property>
        <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>slave01:50090</value>
        </property>

</configuration>
  • 配置YARN:yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- 指定YARN的ResourceManager的地址 -->
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>master</value>
        </property>

        <!-- reducer获取数据的方式 -->
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>

        <!-- 日志聚集功能使用 -->
        <property>
                <name>yarn.log-aggregation-enable</name>
                <value>true</value>
        </property>

        <!-- 日志保留时间设置7天 -->
        <property>
                <name>yarn.log-aggregation.retain-seconds</name>
                <value>604800</value>
        </property>

</configuration>
  • MapReduce配置:mapred-site.xml
<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>master:10020</value>
        </property>
        <property>
                <name>mapreduce.jobhistory.webapp.address</name>
                <value>master:19888</value>
        </property>
        <!--第三方框架使用yarn计算的日志聚集功能 -->
        <property>
                <name>yarn.log.server.url</name>
                <value>http://master:19888/jobhistory/logs</value>
        </property>
</configuration>
  • 注意:将配置好的文件分发到每一个节点

2.3.2 集群单点启动

            同上面伪分布式

2.3.3 配置SSH免密登陆

  • 生成密钥
[hadoop@master hadoop]$ ssh-keygen -t rsa
  • 将密钥拷贝到需要免密登陆的机器上
[hadoop@master hadoop]$ ssh-copy-id hadoop@slave01

2.3.4 群起集群

  • 配置slaves
[hadoop@master hadoop]$ cat slaves 
slave01
slave02
slave03
  • 启动集群
# 启动hdfs
[hadoop@master sbin]$ start-dfs.sh

# 启动YARN
[hadoop@master sbin]$ start-yarn.sh

注意:NameNode和ResourceManger如果不是同一台机器,不能在NameNode上启动 YARN,应该在ResouceManager所在的机器上启动YARN。

2.3.5 集群启动/停止方式

  • 停止HDFS
[hadoop@master sbin]$ stop-dfs.sh
  • 停止YARN
[hadoop@master sbin]$ stop-yarn.sh

2.3.6 集群时间同步

  • 集群时间同步方式:找一个机器,作为时间服务器,所有的机器与这台集群时间服务器进行定时的同步。
  • 配置ntp服务器