大家早上好呀,因为五一放假后事情很多,快小半个月没有更新了,感觉快荒废了,所以打算写点东西,和小伙伴们一起分享
  今天给大家带来的是,最新版Hadoop 3.3.2 安装教程,这次集群搭建教程,不同以往,老哥最近学了脚本分发技术,以后集群搭建,都是走脚本分发
  话不多说,开始搭建
1、首先,准备3台服务器,虚拟机也行,注意:Hadoop是java写的,提前安装好jdk(之前的文章有安装jdk教程)

2、查看jdk版本,确认一下

[root@VM-4-12-centos jdk1.8.0_261]# java -version
java version "1.8.0_261"
Java(TM) SE Runtime Environment (build 1.8.0_261-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.261-b12, mixed mode)

3、官方下载地址,Hadoop官网,当然,如果需要快速下载,可以使用大厂镜像下载Hadoop安装包
https://mirrors.cloud.tencent.com/apache/hadoop/common/hadoop-3.3.2/

3A、创建hadoop文件夹,存放Hadoop安装包

[root@VM-4-12-centos /]# cd opt/ && mkdir hadoop

5、进入文件夹,上传jar包

[root@VM-4-12-centos /]# cd opt/hadoop/
[root@VM-4-12-centos hadoop]# ll
total 623696
-rw-r--r-- 1 root root 638660563 May 11 09:19 hadoop-3.3.2.tar.gz

6、解压,时间有点久,稍等片刻

[root@VM-4-12-centos hadoop]# tar -zxvf hadoop-3.3.2.tar.gz

7、进入解压好的安装包,因为是java写的软件,最好配置一下环境变量

[root@VM-4-12-centos hadoop]# cd hadoop-3.3.2/
[root@VM-4-12-centos hadoop-3.3.2]# pwd
/opt/hadoop/hadoop-3.3.2
[root@VM-4-12-centos hadoop-3.3.2]# vi /etc/profile
# /etc/profile

# System wide environment and startup programs, for login setup
# Functions and aliases go in /etc/bashrc

export JAVA_HOME=/opt/jdk/jdk1.8.0_333
export PATH=$PATH:$JAVA_HOME/bin

export HADOOP_HOME=/opt/hadoop/hadoop-3.3.2
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

8、刷新配置,确保立刻生效

[root@VM-4-12-centos hadoop-3.3.2]# source /etc/profile

9、输入hadoop,参看是否配置成功,出现下面提示,说明单机配置ok

[root@VM-4-12-centos hadoop-3.3.2]# hadoop
Usage: hadoop [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    hadoop [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class

  OPTIONS is none or any of:

buildpaths                       attempt to add class files from build tree
--config dir                     Hadoop config directory
--debug                          turn on shell script debug mode
--help                           usage information
hostnames list[,of,host,names]   hosts to use in slave mode
hosts filename                   list of hosts to use in slave mode
loglevel level                   set the log4j level for this command
workers                          turn on worker mode

10、现在我们已经搭好了单台服务器,然后集群部署需要在不同的机器上,接下来,开始介绍这篇文章的重头戏,集群分发,集群拷贝技术这里介绍三种,一种是传统的安全拷贝 scp,一种是远程同步工具 rsync,最后一种是目前流行的分发脚本 xsync。

11、增加主机映射(Scp 分发脚本需要用到主机名,但是ip地址很长,需要做个映射,当然,你不嫌麻烦,也可以不增加),另外把其他两台服务器也调整一下

[root@VM-4-12-centos hadoop]# vi /etc/hosts

127.0.0.1 VM-4-12-centos VM-4-12-centos
127.0.0.1 localhost.localdomain localhost
127.0.0.1 localhost4.localdomain4 localhost4

::1 VM-4-12-centos VM-4-12-centos
::1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6

your 内网ip1   tencent01
your 内网ip2 tencent02
your 内网ip3 tencent03
127.0.0.1 VM-4-2-centos VM-4-2-centos
127.0.0.1 localhost.localdomain localhost
127.0.0.1 localhost4.localdomain4 localhost4

::1 VM-4-2-centos VM-4-2-centos
::1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6

your 内网ip1   tencent01
your 内网ip2 tencent02
your 内网ip3 tencent03
[root@VM-12-13-centos /]# vi /etc/hosts
127.0.0.1 VM-12-13-centos VM-12-13-centos
127.0.0.1 localhost.localdomain localhost
127.0.0.1 localhost4.localdomain4 localhost4

::1 VM-12-13-centos VM-12-13-centos
::1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6

your 内网ip1   tencent01
your 内网ip2 tencent02
your 内网ip3 tencent03

刷新一下配置

[root@VM-4-12-centos hadoop]# source /etc/profile

12、然后确认一下,另外两台服务器文件夹是否有hadoop文件夹

[root@VM-4-2-centos hadoop]# pwd
/opt/hadoop
[root@VM-12-13-centos hadoop]# pwd
/opt/hadoop

13、如果需要快速搭建,直接学第三种分发脚本,跳到17,按照scp语法,拷贝文件,拷贝过程中,中间需要输入一次yes,然后是输入另外一台服务器的登录密码

linux 集群软件pacemaker_linux

[root@VM-4-12-centos hadoop-3.3.2]# cd ..
[root@VM-4-12-centos hadoop]# ll
total 623700
drwxr-xr-x 10  501 dialout      4096 Feb 22 04:42 hadoop-3.3.2
-rw-r--r--  1 root root    638660563 May 11 09:19 hadoop-3.3.2.tar.gz
[root@VM-4-12-centos hadoop]# scp -r hadoop-3.3.2 root@tencent02:/opt/hadoop

14、传统的同步拷贝比较慢,需要耐心等待,然后我们介绍第二种 rsync,它的优势是:相比于同步拷贝,不会重复拷贝没有修改的数据

linux 集群软件pacemaker_big data_02


15、由于第一台服务器,在给第二台服务器拷贝,我们第三台还没有同步,这个时候,我们可以考虑,用第三台主动去拉取第一台的数据,节省时间

先切换到第三台服务器

[root@VM-12-13-centos hadoop]# pwd
/opt/hadoop
[root@VM-12-13-centos hadoop]# scp -r root@tencent01:/opt/hadoop/hadoop-3.3.2 ./

16、拷贝完成以后,另外两台服务器,记得配置一下,环境变量,上面有步骤,这里就不写了。

17、终于来到 xsync脚本分发了,企业开发中,这种因为可以快速分发,最实用
切换到第一台服务器,创建文件夹

[root@VM-4-12-centos hadoop]# mkdir xsync_bin
[root@VM-4-12-centos hadoop]# cd xsync_bin/
[root@VM-4-12-centos xsync_bin]# pwd
/opt/hadoop/xsync_bin

18、然后配置全局环境变量

[root@VM-4-12-centos xsync_bin]# vi /etc/profile
#/etc/profile

# System wide environment and startup programs, for login setup
# Functions and aliases go in /etc/bashrc

export JAVA_HOME=/opt/jdk/jdk1.8.0_333
export PATH=$PATH:$JAVA_HOME/bin

export HADOOP_HOME=/opt/hadoop/hadoop-3.3.2
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

export XSYNC_HOME=/opt/hadoop/xsync_bin
export PATH=$PATH:$XSYNC_HOME

19、刷新配置,以及查看

[root@VM-4-12-centos xsync_bin]# source /etc/profile
[root@VM-4-12-centos xsync_bin]# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/jdk/jdk1.8.0_333/bin:/root/bin:/opt/jdk/jdk1.8.0_333/bin:/opt/hadoop/hadoop-3.3.2/bin:/opt/hadoop/hadoop-3.3.2/sbin:/opt/jdk/jdk1.8.0_333/bin:/opt/hadoop/hadoop-3.3.2/bin:/opt/hadoop/hadoop-3.3.2/sbin:/opt/jdk/jdk1.8.0_333/bin:/opt/hadoop/hadoop-3.3.2/bin:/opt/hadoop/hadoop-3.3.2/sbin:/opt/hadoop/xsync_bin

20、在当前文件夹下,编写xsync脚本,这个脚本需要有一定的shell基础,才可以看得懂,不懂也没关系,老哥大致讲一下,它首先$# 是拿到一个参数,如果这个参数没有,直接提示,exit退出,然后遍历第二台和第三台服务器,拿到该文件夹的绝对路径,复制到第二台和第三台服务器。

[root@VM-4-12-centos xsync_bin]# vim xsync
#!/bin/bash
 
#1. 判断参数个数
 
if [ $# -lt 1 ]
 
then
 
  echo Not Enough Arguement!
 
  exit;
 
fi
 
#2. 遍历集群所有机器
 
for host in tencent02 tencent03
 
do
 
  echo ====================  $host  ====================
 
  #3. 遍历所有目录,挨个发送
 
  for file in $@
 
  do
 
    #4 判断文件是否存在
 
    if [ -e $file ]
 
    then
 
      #5. 获取父目录
 
      pdir=$(cd -P $(dirname $file); pwd)
 
      #6. 获取当前文件的名称
 
      fname=$(basename $file)
 
      ssh $host "mkdir -p $pdir"
 
      rsync -av $pdir/$fname $host:$pdir
 
    else
 
      echo $file does not exists!
 
    fi
 
  done
 
done

21、给脚本赋予执行权限

[root@VM-4-12-centos xsync_bin]# chmod 777 xsync 
[root@VM-4-12-centos xsync_bin]# ll
total 4
-rwxrwxrwx 1 root root 675 May 11 12:28 xsync

22、同步之前,我们需要配置一次Rsa秘钥,避免上面的脚本执行过程中,输入密码。我们直接配置免密登录,互换公钥,然后自己私钥加密数据,用对方的公钥解密,保证数据安全

输入命令,然后一直回车,会得到一个Rsa密钥

[root@VM-4-12-centos hadoop-3.3.2]# cd /root/
[root@VM-4-12-centos hadoop-3.3.2]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.

23、查看该目录下所有文件,包括隐藏文件

[root@VM-4-12-centos ~]# ls -al
total 52
dr-xr-x---.  5 root root 4096 May 12 08:27 .
dr-xr-xr-x. 19 root root 4096 May 12 08:32 ..
-rw-------   1 root root 1358 May 12 08:31 .bash_history
-rw-r--r--.  1 root root   18 May 11  2019 .bash_logout
-rw-r--r--.  1 root root  176 May 11  2019 .bash_profile
-rw-r--r--.  1 root root  176 May 11  2019 .bashrc
drwx------   2 root root 4096 Mar 10 19:25 .cache
-rw-r--r--.  1 root root  100 May 11  2019 .cshrc
drwxr-xr-x   2 root root 4096 Jun 10  2021 .pip
-rw-r--r--   1 root root   73 May 11 20:32 .pydistutils.cfg
drwx------   2 root root 4096 May 12 08:31 .ssh
-rw-r--r--.  1 root root  129 May 11  2019 .tcshrc
-rw-------   1 root root  910 May 12 08:27 .viminfo

24、进入.ssh文件夹,发现2个rsa文件,第一个是私钥,密码很长,可以用cat命令查看,第二个是公钥,我们需要把公钥发给另外两台服务器

[root@VM-4-2-centos ~]# cd ./.ssh/
[root@VM-4-2-centos .ssh]# ll

linux 集群软件pacemaker_linux 集群软件pacemaker_03

25、发送,第一次发送需要验证密码,可以给本机也配置一个rsa密码,另外,其他两台服务器也需要重复22、23、24、25步骤

[root@VM-4-12-centos .ssh]# 
[root@VM-4-12-centos .ssh]# ssh-copy-id tencent02
[root@VM-4-12-centos .ssh]# ssh-copy-id tencent03

26、同步Hadoop之前,我们先同步jdk,记得配下另外两台环境变量,包括jdk和hadoop的,定位18、19

[root@VM-4-12-centos .ssh]# cd /opt/jdk/
[root@VM-4-12-centos jdk]# /opt/hadoop/xsync_bin/xsync ./jdk1.8.0_333/

27、开始同步Hadoop,这个安装包很大,时间有点长,需要忍一下,然后需要多次输入密码,耐心等待

[root@VM-4-12-centos jdk]# cd ../hadoop/
[root@VM-4-12-centos hadoop]# ll
total 623704
drwxr-xr-x 10  501 dialout      4096 Feb 22 04:42 hadoop-3.3.2
-rw-r--r--  1 root root    638660563 May 11 09:19 hadoop-3.3.2.tar.gz
drwxr-xr-x  2 root root         4096 May 11 12:28 xsync_bin
[root@VM-4-12-centos hadoop]# ./xsync_bin/xsync ./hadoop-3.3.2

28、切换到另外两台,成功!

linux 集群软件pacemaker_linux_04


linux 集群软件pacemaker_hadoop_05

29、三台服务器都分发好以后,开始配置集群
切换到第一台服务器,进入核心配置文件,Hadoop的配置是我目前见过,最恶心的,大家要注意,这些配置不能有一丢丢差别

linux 集群软件pacemaker_big data_06

[root@VM-4-12-centos hadoop]# cd hadoop-3.3.2/etc/hadoop/
[root@VM-4-12-centos hadoop]# pwd
/opt/hadoop/hadoop-3.3.2/etc/hadoop
[root@VM-4-12-centos hadoop]# vi core-site.xml
<configuration>
<!-- 指定NameNode的地址 -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://tencent01:8020</value>
    </property>

    <!-- 指定hadoop数据的存储目录 -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/hadoop/hadoop-3.3.2/data</value>
    </property>

</configuration>
[root@VM-4-12-centos hadoop]# vi hdfs-site.xml
<configuration>
    <!-- nn web端访问地址-->
    <property>
        <name>dfs.namenode.http-address</name>
        <value>tencent01:9870</value>
    </property>
    <!-- 2nn web端访问地址-->
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>tencent03:9868</value>
    </property>
</configuration>
[root@VM-4-12-centos hadoop]# vi yarn-site.xml
<configuration>

<!-- Site specific YARN configuration properties -->

   <!-- 指定MR走shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <!-- 指定ResourceManager的地址-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>tencent02</value>
    </property>

    <!-- 环境变量的继承 -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DIST
CACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>

</configuration>
[root@VM-4-12-centos hadoop]# vi mapred-site.xml
<configuration>
    <!-- 指定MapReduce程序运行在Yarn上 -->
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

25、分发脚本,到其他服务器上

[root@VM-4-12-centos hadoop]# /opt/hadoop/xsync_bin/xsync /opt/hadoop/hadoop-3.3.2/etc/hadoop/

26、配置works,默认是localhost,修改成我们需要的

[root@VM-4-12-centos hadoop]# pwd
/opt/hadoop/hadoop-3.3.2/etc/hadoop
[root@VM-4-12-centos hadoop]# vi workers
tencent01
tencent02
tencent03

27、分发脚本,修改的配置

[root@VM-4-12-centos hadoop]# /opt/hadoop/xsync_bin/xsync /opt/hadoop/hadoop-3.3.2/etc/

28、初始化hdfs

[root@VM-4-12-centos hadoop-3.3.2]# cd /opt/hadoop/hadoop-3.3.2
[root@VM-4-12-centos hadoop-3.3.2]# hdfs namenode -format

29、刚刚我们在xml文件中,配置了一些端口,需要到云服务开放端口,第一台开放8020,9870,第二台开放8088,第三台开放9868端口

30、添加root为Hadoop启动用户,默认情况下,Hadoop不允许root启动,真的是艹蛋

[root@VM-4-12-centos .ssh]# cd /opt/hadoop/hadoop-3.3.2/etc/hadoop
[root@VM-4-12-centos hadoop]# vi hadoop-env.sh
# The java implementation to use. By default, this environment
# variable is REQUIRED on ALL platforms except OS X!
export JAVA_HOME=/opt/jdk/jdk1.8.0_333

export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_JOURNALNODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

# Location of Hadoop.  By default, Hadoop will attempt to determine
# this location based upon its execution path.

31、分发脚本

[root@VM-4-12-centos hadoop]# /opt/hadoop/xsync_bin/xsync /opt/hadoop/hadoop-3.3.2/etc/

32、终于到了激动人心的时刻,先初始化,然后启动

[root@VM-4-12-centos hadoop]# cd ../../
[root@tencent01 hadoop-3.3.2]# hdfs namenode -format
[root@VM-4-12-centos hadoop-3.3.2]# ./sbin/start-dfs.sh 
Starting namenodes on [tencent01]
Last login: Wed May 11 16:18:51 CST 2022 on pts/0
Starting datanodes
Last login: Wed May 11 16:21:19 CST 2022 on pts/0
tencent03: WARNING: /opt/hadoop/hadoop-3.3.2/logs does not exist. Creating.
tencent02: WARNING: /opt/hadoop/hadoop-3.3.2/logs does not exist. Creating.
Starting secondary namenodes [tencent03]
Last login: Wed May 11 16:21:21 CST 2022 on pts/0

33、检查

[root@tencent01 hadoop-3.3.2]# jps
28476 NameNode
28942 Jps
28638 DataNode

34、启动第二台服务器,yarn的时候,还需要修改一下启动和关闭的脚本,为root启动,它默认不是root启动的

[root@VM-4-2-centos sbin]# vi start-yarn.sh
[root@VM-4-2-centos sbin]# vi stop-yarn.sh
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root

35、启动yarn,检查

[root@tencent02 hadoop-3.3.2]# sbin/start-yarn.sh 
Starting resourcemanager
Last login: Thu May 12 15:54:18 CST 2022 from 183.221.94.151 on pts/0
Last failed login: Thu May 12 16:24:49 CST 2022 from 45.134.26.180 on ssh:notty
There were 2 failed login attempts since the last successful login.
Starting nodemanagers
Last login: Thu May 12 17:48:45 CST 2022 on pts/0
[root@tencent02 hadoop-3.3.2]# jps
20498 Jps
19634 DataNode
20121 NodeManager
19951 ResourceManager

36、查看第三台

[root@tencent03 hadoop-3.3.2]# jps
19553 NodeManager
19733 Jps
19238 DataNode

37、最后,就默默的说一句,屏幕前的各位大帅逼,还有大漂亮,看到这里,麻烦给老哥一个点赞、关注、收藏三连好吗,你的支持是老哥更新最大的动力,谢谢!