1.环境变量配置
2.ssh 免密配置
在ssh文件夹下
ssh-keygen -t rsa
ssh-copy-id hadoop102
cat ~/.ssh/id_dsa.pub >>~/.ssh/authorized_keys
hdfs namenode -format
2.配置集群
配置workers hdoop/worker
but there is no HDFS_NAMENODE_USER defined. Aborting operation.
but there is no HDFS_SECONDARYNAMENODE_USER defined. Aborting operation.
修改hadoop/etc/hadoop/hadoop-env.sh 环境变量
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
hadoop fs -put ..... /
cat blk_1073741826>>tmp.tar.gz
start-dfs.sh/stop-dfs.sh
start-yarn.sh/stop-yarn.sh
hdfs --daemon start/stop namenode/datanode/secondarynamenode
yarn --daemon start/stop resourcemanager/nodemanager
hadoop fs -put
单一节点启动 hdfs --daemon start datanode
二个执行脚本
常用端口号
HDFS
namenode 内部通讯端口 :8020 、9000 9820
用户查询端口 9870 http://hadoop101:9870
yarn查看任务运行情况 8088 http://hadoop102:8088
历史服务器 19888
mapred-site.xml
core-site.xml
hdfs-site.xml
yarn-site.xml
workers
HDFS
hdfs shell 操作
hadoop fs -mkdir /sanguo 根目录创建上传一个文件夹
hadoop fs -moveFromLocal ./shuguo.txt /sanguo 上传并删除shuguo.txt
hadoop fs -copyFromLocal ./shuguo.txt /sanguo 上传
hadoop fs -put./shuguo.txt /sanguo 上传 (生产中更多的使用put)
hadoop fs -appendToFile liubei.txt /sanguo/shuguo.txt 拼接file
下载操作
hadoop fs -copyToLocal /sanguo/shuguo.txt ./
hadoop fs get /sanguo/shuguo.txt ./ 与上相同
HDFS直接操作
hadoop fs -ls /sanguo
hadoop fs -cat /jinguo/shuguo.txt 查看
hadoop fs -cp /sanguo/shuguo.txt /jinguo 复制过去
hadoop fs -mv /sanguo/weiguo.txt /jinguo 移动过去
hadoop fs -du -s -h /jinguo 查看文件夹总文件
hadoop fs -du -h /jinguo 查看所有文件及大小
hadoop fs -setrep 10 /jinguo/weiguo.txt 设置副本
hadoop fs get
hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>] -appendToFile /localpath/a.txt /hdfs/b.txt 追加文件
[-cat [-ignoreCrc] <src> ...] -cat
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] -chmod -R 777 /path -R 递归
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>]
[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...]
[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] [-v] [-x] <path> ...]
[-expunge]
[-find <path> ... <expression> ...]
[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
[-head <file>]
[-help [cmd ...]]
[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...] -rm -f /path
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] [-s <sleep interval>] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touch [-a] [-m] [-t TIMESTAMP ] [-c] <path> ...]
[-touchz <path> ...]
[-truncate [-w] <length> <path> ...]
[-usage [cmd ...]]
HDFS API操作
https://gitee.com/hujf2017/hadoop.git
HDFS数据读写流程
读流程
写流程
机架感知
namenode 和secondNamenod
工作机制
namenode 存储的元数据在什么地方?
1.磁盘 效率低
2.内存 可靠性差
3.fsimage 磁盘追加 (因为hdfs修改效率不太行)
Edits 追加内容
关机时 fsimage+Edits 合并
开机时 数据集进入 内存
针对合并数据太慢 出现了2NN
查看镜像文件
1.进入目录
hdfs oiv -p XMI -i 文件名 -o 文件路径
hdfs oiv -p XMI -i fsimage_0000000000000000382 -o /opt/module/fsimage.xml
hdfs oiv -p XMI -i edits_inprogress_0000000000000000383 -o /opt/module/log.xml
mapreduce
mapreduce是一个编程框架 解决了底层通讯 多线程
优点:
1.易与编程 ,用户只关心业逻辑 实现框架接口
2.良好的扩展性 ,动态增加服务器 解决计算机不够等问题
3.高容错性 计算过程中可以实现单节点任务失效转移功能
4.适合海量数据的计算(TB /PB) 几千台服务器共同计算
缺点
1.不擅长实时计算 毫秒级查询 用mysql较好
2.不擅长流式计算 用sparkstreaming flink
3.不擅长DAG有向无环图计算 多job串联任务 用spark 迭代器 基于内存 所以更好
sz hadoop-mapreduce-examples-3.1.3.jar 下载文件
1. extends Mapper
每行都会进入map方法中 在这个方法中对行内容进行遍历 行偏移量 和内容
2.
hadoop jar wc.jar com.hujf.mapreduce.wordcount.driver.WordCountDriver /input /output4 必须带上全类名
CombineTextInputFormat
虚拟存储机制
job.setInputFormatClass(CombineTextInputFormat.class);
//设置切片最大值
CombineTextInputFormat.setMaxInputSplitSize(job,4194304);
shuffle
机制 mapper方法之后,Reduce方法之前 数据处理过程称之为Shuffle
map进入到环形缓冲区 左侧索引右侧数据 80%时反向溢写 不至于等待
数据排序 1.快排 2.对索引 3.字典序
1.map 之后
分区块 Partitioner
有比较 WritableComparable
reduce预处理 Combiner 处于每个maptest上 局部汇总 可以求和 但是求平均会有问题
yarn 命令
yarn 参数
yarn生产调优
mapred --daemon stop JobHistoryServer
搭建hbase集群
1.修改hbase-site.xml
2.修改regesrt