1、hadoop介绍

1.1、官网介绍

hadoop官网:hadoop.apache.org
类似的Apache组件的网址基本都是 XXX.apache.org,如spark.apache.org,kafka.apache.org。
要学会看官网的,找参数。
广义概念上的hadoop指的是以apache hadoop软件为主的生态圈,包括但不限于hive、sqoop、flume、spark、flink、 hbase等;狭义概念上的hadoop指的就是apache hadoop软件,它是开源的。

1.2、各版本使用情况

apache hadoop软件:

hadoop版本

使用情况

对应CDH版本

1.x

基本不用

2.x

企业主流

CDH5.x系列

3.x

尝试使用

CDH6.x系列

1.3、软件版本选择

本文使用的是cdh版本的安装包,hadoop-2.6.0-cdh5.16.2.tar.gz的安装包。对应官网https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/SingleCluster.html
选择cdh的好处:不必考虑版本兼容性。比如要安装hbase,只要找http://archive.cloudera.com/cdh5/cdh/5/hbase-1.2.0-cdh5.16.2 下对应的安装包,flume只要找http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.16.2 下对应的安装包,保持cdh5.16.2一致,就不用再考虑版本的兼容性。后续若使用后发现存在bug,不得不升级解决,则应该在对应版本的以后版本中找 changes.log文件,看看对应的bug是否已经解决,再选择对应的版本升级。如http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.16.2-changes.log

1.4、hadoop框架介绍

名称

内容

延伸

hdfs

负责存储

mapreduce

负责计算

由于开发难度高,代码量大,维护困难,计算慢,所以大家基本不会使用MR,都使用hql、spark、flink

yarn

负责资源作业调度

主要资源:内存 VCORE

2、HDFS部署

Now you are ready to start your Hadoop cluster in one of the three supported modes:

模式

名称

使用情况

Local (Standalone) Mode

本地模式

不用

Pseudo-Distributed Mode

伪分布式模式

学习 测试 1台

Fully-Distributed Mode

分布式模型 集群模式

生产

此处安装的是伪分布式模式

2.0 修改主机名

[root@JD ~]# hostnamectl set-hostname rzdata001
[root@JD ~]# reboot   # 重启

2.1 创建用户 目录

[root@rzdata001 ~]# useradd ruoze
[root@rzdata001 ~]# su - ruoze
[ruoze@rzdata001 ~]$ mkdir app software sourcecode log tmp data lib
[ruoze@rzdata001 ~]$ ll
total 0
drwxrwxr-x 2 ruoze ruoze 6 Nov 27 21:32 app            # 解压的文件夹  软连接
drwxrwxr-x 2 ruoze ruoze 6 Nov 27 21:32 data           # 数据
drwxrwxr-x 2 ruoze ruoze 6 Nov 27 21:32 lib            # 第三方的jar
drwxrwxr-x 2 ruoze ruoze 6 Nov 27 21:32 log            # 日志文件夹
drwxrwxr-x 2 ruoze ruoze 6 Nov 27 21:32 software       # 压缩包
drwxrwxr-x 2 ruoze ruoze 6 Nov 27 21:32 sourcecode     # 源代码编译
drwxrwxr-x 2 ruoze ruoze 6 Nov 27 21:32 tmp            # 临时文件夹 ???/tmp
[ruoze@rzdata001 ~]$

2.2 上传压缩包

[ruoze@rzdata001 ~]$ cd  software

rz上传

2.3 解压

[ruoze@rzdata001 software]$ tar -zxvf hadoop-2.6.0-cdh5.16.2.tar.gz
[ruoze@rzdata001 software]$ ll
total 424180
drwxr-xr-x 14 ruoze ruoze      4096 Jun  3 19:11 hadoop-2.6.0-cdh5.16.2
-rw-r--r--  1 ruoze ruoze 434354462 Nov 28 12:47 hadoop-2.6.0-cdh5.16.2.tar.gz
[ruoze@rzdata001 software]$ 
[ruoze@rzdata001 software]$ 
[ruoze@rzdata001 software]$ mv hadoop-2.6.0-cdh5.16.2 ../app/
[ruoze@rzdata001 software]$ cd ../app/
[ruoze@rzdata001 app]$ ll
total 4
drwxr-xr-x 14 ruoze ruoze 4096 Jun  3 19:11 hadoop-2.6.0-cdh5.16.2
[ruoze@rzdata001 app]$ ln -s hadoop-2.6.0-cdh5.16.2 hadoop 
[ruoze@rzdata001 app]$ ll
total 4
lrwxrwxrwx  1 ruoze ruoze   22 Nov 28 21:24 hadoop -> hadoop-2.6.0-cdh5.16.2
drwxr-xr-x 14 ruoze ruoze 4096 Jun  3 19:11 hadoop-2.6.0-cdh5.16.2
[ruoze@rzdata001 app]$

2.4 环境要求

2.4.1 java安装,此处已安装
[ruoze@rzdata001 app]$ which java
/usr/java/jdk1.8.0_121/bin/java
[ruoze@rzdata001 app]$
2.4.2 ssh 此处已安装

2.5 JAVA_HOME 显性配置

[ruoze@rzdata001 app]$ cd hadoop/etc/hadoop
[ruoze@rzdata001 hadoop]$ pwd
/home/ruoze/app/hadoop/etc/hadoop
[ruoze@rzdata001 hadoop]$ ll
total 156
-rw-r--r-- 1 ruoze ruoze  4436 Jun  3 19:04 capacity-scheduler.xml
-rw-r--r-- 1 ruoze ruoze  1335 Jun  3 19:04 configuration.xsl
-rw-r--r-- 1 ruoze ruoze   318 Jun  3 19:04 container-executor.cfg
-rw-r--r-- 1 ruoze ruoze   774 Jun  3 19:04 core-site.xml
-rw-r--r-- 1 ruoze ruoze  3670 Jun  3 19:04 hadoop-env.cmd
-rw-r--r-- 1 ruoze ruoze  4224 Jun  3 19:04 hadoop-env.sh
-rw-r--r-- 1 ruoze ruoze  2598 Jun  3 19:04 hadoop-metrics2.properties
-rw-r--r-- 1 ruoze ruoze  2490 Jun  3 19:04 hadoop-metrics.properties
-rw-r--r-- 1 ruoze ruoze  9683 Jun  3 19:04 hadoop-policy.xml
-rw-r--r-- 1 ruoze ruoze   775 Jun  3 19:04 hdfs-site.xml
-rw-r--r-- 1 ruoze ruoze  2230 Jun  3 19:04 httpfs-env.sh
-rw-r--r-- 1 ruoze ruoze  1657 Jun  3 19:04 httpfs-log4j.properties
-rw-r--r-- 1 ruoze ruoze    21 Jun  3 19:04 httpfs-signature.secret
-rw-r--r-- 1 ruoze ruoze   620 Jun  3 19:04 httpfs-site.xml
-rw-r--r-- 1 ruoze ruoze  3523 Jun  3 19:04 kms-acls.xml
-rw-r--r-- 1 ruoze ruoze  3139 Jun  3 19:04 kms-env.sh
-rw-r--r-- 1 ruoze ruoze  1788 Jun  3 19:04 kms-log4j.properties
-rw-r--r-- 1 ruoze ruoze  5933 Jun  3 19:04 kms-site.xml
-rw-r--r-- 1 ruoze ruoze 12601 Jun  3 19:04 log4j.properties
-rw-r--r-- 1 ruoze ruoze   938 Jun  3 19:04 mapred-env.cmd
-rw-r--r-- 1 ruoze ruoze  1383 Jun  3 19:04 mapred-env.sh
-rw-r--r-- 1 ruoze ruoze  4113 Jun  3 19:04 mapred-queues.xml.template
-rw-r--r-- 1 ruoze ruoze   758 Jun  3 19:04 mapred-site.xml.template
-rw-r--r-- 1 ruoze ruoze    10 Jun  3 19:04 slaves
-rw-r--r-- 1 ruoze ruoze  2316 Jun  3 19:04 ssl-client.xml.example
-rw-r--r-- 1 ruoze ruoze  2697 Jun  3 19:04 ssl-server.xml.example
-rw-r--r-- 1 ruoze ruoze  2237 Jun  3 19:04 yarn-env.cmd
-rw-r--r-- 1 ruoze ruoze  4567 Jun  3 19:04 yarn-env.sh
-rw-r--r-- 1 ruoze ruoze   690 Jun  3 19:04 yarn-site.xml
[ruoze@rzdata001 hadoop]$ 
[ruoze@rzdata001 hadoop]$ vim hadoop-env.sh 
export JAVA_HOME=/usr/java/jdk1.8.0_121        # bug,必须手动配置

2.6 配置文件

[ruoze@rzdata001 hadoop]$ cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

192.168.0.3 rzdata001
[ruoze@rzdata001 hadoop]$
[ruoze@rzdata001 hadoop]$ vim core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://rzdata001:9000</value>
    </property>
</configuration>

[ruoze@rzdata001 hadoop]$
[ruoze@rzdata001 hadoop]$ vim hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

[ruoze@rzdata001 hadoop]$

2.7 ssh无密码信任关系

[ruoze@rzdata001 hadoop]$ cd 
[ruoze@rzdata001 ~]$ pwd
/home/ruoze
[ruoze@rzdata001 ~]$ ls -la
total 24
drwx------  10 ruoze ruoze 4096 Nov 28 21:34 .
drwxr-xr-x.  6 root  root    71 Nov 24 19:36 ..
drwxrwxr-x   3 ruoze ruoze   48 Nov 28 21:24 app
-rw-------   1 ruoze ruoze  396 Nov 28 14:22 .bash_history
-rw-r--r--   1 ruoze ruoze   18 Nov 17 16:35 .bash_logout
-rw-r--r--   1 ruoze ruoze  193 Nov 17 16:35 .bash_profile
-rw-r--r--   1 ruoze ruoze  231 Nov 17 16:35 .bashrc
drwxrwxr-x   2 ruoze ruoze    6 Nov 27 21:34 data
drwxrwxr-x   2 ruoze ruoze    6 Nov 27 21:34 lib
drwxrwxr-x   2 ruoze ruoze    6 Nov 27 21:34 log
drwxrwxr-x   2 ruoze ruoze   42 Nov 28 21:23 software
drwxrwxr-x   2 ruoze ruoze    6 Nov 27 21:34 sourcecode
drwx------   2 ruoze ruoze   24 Nov 28 21:26 .ssh
drwxrwxr-x   2 ruoze ruoze    6 Nov 27 21:34 tmp
-rw-------   1 ruoze ruoze 2231 Nov 28 21:34 .viminfo
[ruoze@rzdata001 ~]$ cd .ssh
[ruoze@rzdata001 .ssh]$ ll
total 4
-rw-r--r-- 1 ruoze ruoze 171 Nov 28 21:26 known_hosts
[ruoze@rzdata001 .ssh]$ cd ..
[ruoze@rzdata001 ~]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/ruoze/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/ruoze/.ssh/id_rsa.
Your public key has been saved in /home/ruoze/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:mjU3RrWZ8YMEp64oeIMSSGJEqyYBreXFslzerByuFfY ruoze@rzdata001
The key's randomart image is:
+---[RSA 2048]----+
|o+ .      ..=    |
|o = +      = B   |
|oO * o    o = o  |
|B.+ = o  o     . |
|=. + =  S =      |
|o. o= E= = .     |
|. oo+ + .        |
| ... o           |
|                 |
+----[SHA256]-----+
[ruoze@rzdata001 ~]$ ll
[ruoze@rzdata001 ~]$ cd .ssh/
[ruoze@rzdata001 .ssh]$ ll
total 12
-rw------- 1 ruoze ruoze 1675 Nov 28 21:36 id_rsa
-rw-r--r-- 1 ruoze ruoze  397 Nov 28 21:36 id_rsa.pub
-rw-r--r-- 1 ruoze ruoze  171 Nov 28 21:26 known_hosts
[ruoze@rzdata001 .ssh]$ cat ./id_rsa.pub  >> authorized_keys
[ruoze@rzdata001 .ssh]$ ll
total 16
-rw-rw-r-- 1 ruoze ruoze  397 Nov 28 21:38 authorized_keys
-rw------- 1 ruoze ruoze 1675 Nov 28 21:36 id_rsa
-rw-r--r-- 1 ruoze ruoze  397 Nov 28 21:36 id_rsa.pub
-rw-r--r-- 1 ruoze ruoze  171 Nov 28 21:26 known_hosts
[ruoze@rzdata001 .ssh]$ chmod 0600 authorized_keys 
[ruoze@rzdata001 .ssh]$ ll
total 16
-rw------- 1 ruoze ruoze  397 Nov 28 21:38 authorized_keys
-rw------- 1 ruoze ruoze 1675 Nov 28 21:36 id_rsa
-rw-r--r-- 1 ruoze ruoze  397 Nov 28 21:36 id_rsa.pub
-rw-r--r-- 1 ruoze ruoze  171 Nov 28 21:26 known_hosts
[ruoze@rzdata001 .ssh]$

2.8 环境变量 hadoop

[ruoze@rzdata001 .ssh]$ cd
[ruoze@rzdata001 ~]$ vim .bashrc

# env
export HADOOP_HOME=/home/ruoze/app/hadoop
export PATH=${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:$PATH

[ruoze@rzdata001 hadoop]$
[ruoze@rzdata001 ~]$ source .bashrc
[ruoze@rzdata001 ~]$ 
[ruoze@rzdata001 ~]$ which hadoop
~/app/hadoop/bin/hadoop
[ruoze@rzdata001 ~]$

2.9 格式化

[ruoze@rzdata001 ~]$ hdfs namenode -format
·····
19/11/28 21:44:25 INFO common.Storage: Storage directory /tmp/hadoop-ruoze/dfs/name has been successfully formatted.
······
[ruoze@rzdata001 ~]$

2.10 第一次启动

[ruoze@rzdata001 ~]$ start-dfs.sh 
19/11/28 21:45:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [rzdata001]
The authenticity of host 'rzdata001 (192.168.0.3)' can't be established.
ECDSA key fingerprint is SHA256:OLqoaMxlGFbCq4sC9pYgF+FdbcXHbEbtSrnMiGGFbVw.
ECDSA key fingerprint is MD5:d3:5b:4a:ef:8e:00:41:a0:5e:80:ef:75:76:8a:a3:49.
Are you sure you want to continue connecting (yes/no)? yes
rzdata001: Warning: Permanently added 'rzdata001,192.168.0.3' (ECDSA) to the list of known hosts.
rzdata001: starting namenode, logging to /home/ruoze/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-ruoze-namenode-rzdata001.out
localhost: starting datanode, logging to /home/ruoze/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-ruoze-datanode-rzdata001.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is SHA256:OLqoaMxlGFbCq4sC9pYgF+FdbcXHbEbtSrnMiGGFbVw.
ECDSA key fingerprint is MD5:d3:5b:4a:ef:8e:00:41:a0:5e:80:ef:75:76:8a:a3:49.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /home/ruoze/app/hadoop-2.6.0-cdh5.16.2/logs/hadoop-ruoze-secondarynamenode-rzdata001.out
19/11/28 21:46:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[ruoze@rzdata001 ~]$ jps
5905 NameNode
6230 SecondaryNameNode
6030 DataNode
6350 Jps
[ruoze@rzdata001 ~]$

2.11 设置DN SNN都以 rzdata001启动

[ruoze@rzdata001 ~]$ netstat -nltp | grep 5905
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 192.168.0.3:9000        0.0.0.0:*               LISTEN      5905/java           
tcp        0      0 0.0.0.0:50070           0.0.0.0:*               LISTEN      5905/java           
[ruoze@rzdata001 ~]$ netstat -nltp | grep 6230
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:50090           0.0.0.0:*               LISTEN      6230/java           
[ruoze@rzdata001 ~]$ netstat -nltp | grep 6030
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
tcp        0      0 0.0.0.0:50010           0.0.0.0:*               LISTEN      6030/java           
tcp        0      0 0.0.0.0:50075           0.0.0.0:*               LISTEN      6030/java           
tcp        0      0 127.0.0.1:45570         0.0.0.0:*               LISTEN      6030/java           
tcp        0      0 0.0.0.0:50020           0.0.0.0:*               LISTEN      6030/java           
[ruoze@rzdata001 ~]$

可以发现DN和SNN都是以0.0.0.0启动的,现在要改成以rzdata001启动:

[ruoze@rzdata001 hadoop]$ pwd
/home/ruoze/app/hadoop/etc/hadoop
[ruoze@rzdata001 hadoop]$ vim hdfs-site.xml 
<configuration>中添加以下内容
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>rzdata001:50090</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.https-address</name>
        <value>rzdata001:50091</value>
    </property>
    <property>
        <name>dfs.datanode.address</name>
        <value>rzdata001:50010</value>
    </property>
    <property>
        <name>dfs.datanode.http.address</name>
        <value>rzdata001:50075</value>
    </property>
    <property>
        <name>dfs.datanode.ipc.address</name>
        <value>rzdata001:50020</value>
    </property>

2.12官网的参数文件 在哪里找?

此版本,登录:https://hadoop.apache.org/docs/r2.10.0/hadoop-project-dist/hadoop-common/SingleCluster.html
页面左下方:
》Configuration
hdfs-default.xml
hdfs-rbf-default.xml
mapred-default.xml
yarn-default.xml
Deprecated Properties

2.13主要概念

概念

名称

昵称

作用

主备

namenode

名称节点

老大

读写请求先经过它

主节点

datanode

数据节点

小弟

存储数据 检索数据

从节点

secondary namenode

第二名称节点

老二

h+1

主节点的备份节点

大数据组件基本都是主从架构,但是 hbase读写请求不经过老大 master进程