文章预览:
- 1.部署前准备
- 1.创建用户和文件夹,建立好host映射
- 2.jdk部署
- 3.设置hadoop用户的个人环境变量
- 2.部署
- 1.解压 做软连接
- 2.部署模式:
- 3.修改hadoop-env.sh文件
- 4.配置免密
- 5.配置namenode进程以hadoop启动
- 6.配置secondary namenode进程以hadoop启动
- 7.配置datanode进程以hadoop启动
- 8.yarn部署
- 1.修改mapred-site.xml文件
- 2.修改mapred-site.xml文件
- 3.修改yarn-site.xml文件
- 9.格式化namenode
- 10.启动和关闭
- 11.测试
- 1.上传文件
- 2.执行wordcount案例
- 12.需要注意的坑
- 1.修改core-site.xml文件,修改数据存储的路径
- 2.对之前已存在的文件进行迁移
1.部署前准备
1.创建用户和文件夹,建立好host映射
useradd hadoop
su - hadoop
mkdir sourcecode software app log lib data tmp shell
echo "192.168.63.130 hadoop" >> /etc/hosts
2.jdk部署
jdk部署之前参考此网站hadoop对应支持的jdk版本:
https://cwiki.apache.org/confluence/display/HADOOP2/HadoopJavaVersions
创建java的文件夹 mkdir /usr/java
将jdk解压到此目录
设置jdk的环境变量
验证部署的jdk是否成功:which java
3.设置hadoop用户的个人环境变量
2.部署
1.解压 做软连接
将hadoop解压到app目录下,做软连接,在这里做软连接的好处是:版本切换,脚本应用是配置的hadoop,是无感知的 ;另外软连接也可以适用其他场景例如:小盘换大盘 即原来小盘的磁盘容量不够时,可做一个软连接指向大盘可解决容量不够的问题
解压之后ll看目录,只关注bin sbin etc即可 bin中 bin中是可执行命令,sbin是启动停止脚本,etc是配置文件
2.部署模式:
Local (Standalone) Mode 本地模式:1台机器准确说是所有的组件运行在一个单独的java进程中。
Pseudo-Distributed Mode 伪分布式: 1台机器 组件是运行在多个java进程中。
Fully-Distributed Mode 集群模式:多台机器 组件是运行在多个java进程中。
3.修改hadoop-env.sh文件
显性配置JAVA_HOME 软件存在bug不能获取外部环境变量中配置的JAVA_HOME,因此要显性指定,hadoop-env.sh文件在~/app/hadoop/etc/hadoop/目录下:
4.配置免密
因为调用hdfs和yarn的批启动和关闭脚本,正常情况下调用一次需要输入一次密码,最后用ssh hadoop@hadoop date 验证。
配置完免密 在验证ssh的时候,第一次必须输入yes信任他以后,以后就不用输入yes了。
5.配置namenode进程以hadoop启动
修改core-site.xml文件,文件路径在/home/hadoop/app/hadoop/etc/hadoop下
6.配置secondary namenode进程以hadoop启动
修改/home/hadoop/app/hadoop/etc/hadoop/hdfs-site.xml文件
7.配置datanode进程以hadoop启动
修改/home/hadoop/app/hadoop/etc/hadoop/slaves文件
8.yarn部署
1.修改mapred-site.xml文件
如下图若目录下没有mapred-site.xml文件只有mapred-site.xml.template 执行 cp mapred-site.xml.template mapred-site.xml拷贝一份模板文件更名为mapred-site.xml,如下图
2.修改mapred-site.xml文件
添加如下图的配置
3.修改yarn-site.xml文件
添加如下图的配置
9.格式化namenode
[hadoop@hadoop bin]$ /home/hadoop/app/hadoop/bin/hdfs namenode -format
20/12/02 04:24:11 INFO util.GSet: Computing capacity for map NameNodeRetryCache
20/12/02 04:24:11 INFO util.GSet: VM type = 64-bit
20/12/02 04:24:11 INFO util.GSet: 0.029999999329447746% max memory 966.7 MB = 297.0 KB
20/12/02 04:24:11 INFO util.GSet: capacity = 2^15 = 32768 entries
20/12/02 04:24:11 INFO namenode.FSNamesystem: ACLs enabled? false
20/12/02 04:24:11 INFO namenode.FSNamesystem: XAttrs enabled? true
20/12/02 04:24:11 INFO namenode.FSNamesystem: Maximum size of an xattr: 16384
20/12/02 04:24:11 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1081082838-192.168.63.130-1606901051699
20/12/02 04:24:11 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
20/12/02 04:24:11 INFO namenode.FSImageFormatProtobuf: Saving image file /tmp/hadoop-hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
20/12/02 04:24:11 INFO namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 323 bytes saved in 0 seconds .
20/12/02 04:24:11 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
20/12/02 04:24:11 INFO util.ExitUtil: Exiting with status 0
20/12/02 04:24:11 INFO namenode.NameNode: SHUTDOWN_MSG:
20/12/02 04:24:11 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted. 当出现这个字眼的时候表示格式化成功
10.启动和关闭
启动:
jps显示有以上5个进程则说明部署成功,另外注意hadooop这几个组件的启动和关闭顺序。
hdfs组件启动顺序为:Namenode,Datanode、Secondary Namenode
yarn组件启动顺序为:RedourceManager,NodeManager
hdfs组件关闭顺序为:Namenode,Datanode、Secondary Namenode
yarn组件关闭顺序为:NodeManager,RedourceManager
11.测试
创建测试文件
1.上传文件
[hadoop@hadoop data]$ mkdir input #在本地创建目录/home/hadoop/data/input,编辑1.log文件 将1.log上传至hdfs,如下图
2.执行wordcount案例
命令如下
[hadoop@hadoop input]$ hadoop jar \
> ~/app/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0-cdh5.16.2.jar \
> wordcount /wordcount/input/ /wordcount/output/
> #执行如下
以上图片中的结果如下:
20/12/03 00:15:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/12/03 00:15:11 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
20/12/03 00:15:12 INFO input.FileInputFormat: Total input paths to process : 1
20/12/03 00:15:12 INFO mapreduce.JobSubmitter: number of splits:1
20/12/03 00:15:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1606972097533_0001
20/12/03 00:15:13 INFO impl.YarnClientImpl: Submitted application application_1606972097533_0001
20/12/03 00:15:13 INFO mapreduce.Job: The url to track the job: http://hadoop:8088/proxy/application_1606972097533_0001/
20/12/03 00:15:13 INFO mapreduce.Job: Running job: job_1606972097533_0001
20/12/03 00:15:26 INFO mapreduce.Job: Job job_1606972097533_0001 running in uber mode : false
20/12/03 00:15:26 INFO mapreduce.Job: map 0% reduce 0%
20/12/03 00:15:35 INFO mapreduce.Job: map 100% reduce 0%
20/12/03 00:15:42 INFO mapreduce.Job: map 100% reduce 100%
20/12/03 00:15:42 INFO mapreduce.Job: Job job_1606972097533_0001 completed successfully
20/12/03 00:15:43 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=87
FILE: Number of bytes written=286155
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=150
HDFS: Number of bytes written=53
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=6844
Total time spent by all reduces in occupied slots (ms)=3981
Total time spent by all map tasks (ms)=6844
Total time spent by all reduce tasks (ms)=3981
Total vcore-milliseconds taken by all map tasks=6844
Total vcore-milliseconds taken by all reduce tasks=3981
Total megabyte-milliseconds taken by all map tasks=7008256
Total megabyte-milliseconds taken by all reduce tasks=4076544
Map-Reduce Framework
Map input records=6
Map output records=10
Map output bytes=85
Map output materialized bytes=87
Input split bytes=105
Combine input records=10
Combine output records=7
Reduce input groups=7
Reduce shuffle bytes=87
Reduce input records=7
Reduce output records=7
Spilled Records=14
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=289
CPU time spent (ms)=2710
Physical memory (bytes) snapshot=372260864
Virtual memory (bytes) snapshot=5457395712
Total committed heap usage (bytes)=352456704
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=45
File Output Format Counters
Bytes Written=53
打开http://hadoop:8088界面,出现SUCCEEDED即表示wordcount案例运行成功
也可查看wordcount案例输出的内容,如下图,完成单词计数统计功能
12.需要注意的坑
数据存储在 /tmp/hadoop-hadoop不合理(存储位置的默认在tmp/hadoop-${user.name}下,参考官网:https://hadoop.apache.org/docs/r2.10.1/hadoop-project-dist/hadoop-common/core-default.xml 如下图所示)
因为/tmp目录 30天 不访问的文件文件夹会被按照规则删除,所以生产上不要把内容丢在/tmp目录
1.修改core-site.xml文件,修改数据存储的路径
2.对之前已存在的文件进行迁移
因为在官方默认的配置文件中:这三个引用的都是hadoop.tmp.dir这个参数,所以配置hadoop.tmp.dir临时目录改为/home/hadoop/tmp,那么namenode、datanode checkpoint(secondarynamenode)存储的位置也会对应变更。
dfs.namenode.name.dir --> file://${hadoop.tmp.dir}/dfs/name
dfs.datanode.data.dir --> file://${hadoop.tmp.dir}/dfs/data
dfs.namenode.checkpoint.dir --> file://${hadoop.tmp.dir}/dfs/namesecondary
另外,在生产上,一般namenode、datanode的存储目录实际上是固定的,比如 /data01/dfs/nn;datanode 存储目录也是固定,比如 /data01/dfs/dn,/data02/dfs/dn,/data03/dfs/dn 多块磁盘配置为了存储空间和高效率的IO,并行的读写,比单块磁盘更快。并不依赖${hadoop.tmp.dir}参数