背景

2017.12.13日Apache Hadoop 3.0.0正式版本发布,默认支持阿里云OSS对象存储系统,作为Hadoop兼容的文件系统,后续版本号大于等于Hadoop 2.9.x系列也支持OSS。然而,低版本的Apache Hadoop官方不再支持OSS,本文将描述如何通过支持包来使Hadoop 2.7.2能够读写OSS。

如何使用

下面的步骤需要在所有的Hadoop节点执行

下载支持包

http://gosspublic.alicdn.com/hadoop-spark/hadoop-oss-2.7.2.tar.gz

解压这个支持包,里面的文件是:

[root@apache hadoop-oss-2.7.2]# ls -lh 总用量 3.1M -rw-r--r-- 1 root root 3.1M 2月 28 17:01 hadoop-aliyun-2.7.2.jar

这个支持包是根据Hadoop 2.7.2的版本,并打了Apache Hadoop对OSS支持的patch后编译得到,其他的小版本对OSS的支持后续也将陆续提供。

部署

首先将文件hadoop-aliyun-2.7.2.jar复制到$HADOOP_HOME/share/hadoop/tools/lib/目录下;

修改$HADOOP_HOME/libexec/hadoop-config.sh文件,在文件的327行加下代码:

CLASSPATH=$CLASSPATH:$TOOL_PATH

修改的目的就是将$HADOOP_HOME/share/hadoop/tools/lib/放到Hadoop的CLASSPATH里面;下面是修改前后,这个文件的diff供参考(hadoop-config.sh.bak是修改前的文件):

[root@apache hadoop-2.7.2]# diff -C 3 libexec/hadoop-config.sh.bak libexec/hadoop-config.sh *** libexec/hadoop-config.sh.bak 2019-03-01 10:35:59.629136885 +0800 --- libexec/hadoop-config.sh 2019-02-28 16:33:39.661707800 +0800 *************** *** 325,330 **** --- 325,332 ---- CLASSPATH=${CLASSPATH}:$HADOOP_MAPRED_HOME/$MAPRED_DIR'/*' fi + CLASSPATH=$CLASSPATH:$TOOL_PATH + # Add the user-specified CLASSPATH via HADOOP_CLASSPATH # Add it first or last depending on if user has # set env-var HADOOP_USER_CLASSPATH_FIRST

增加OSS的配置

修改core-site.xml文件,增加如下配置项:

配置项

说明

fs.oss.endpoint

如 oss-cn-zhangjiakou-internal.aliyuncs.com

要连接的endpoint

fs.oss.accessKeyId

access key id

fs.oss.accessKeySecret

access key secret

fs.oss.impl

org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem

hadoop oss文件系统实现类,目前固定为这个

fs.oss.buffer.dir

/tmp/oss

临时文件目录

fs.oss.connection.secure.enabled

false

是否enable https, 根据需要来设置,enable https会影响性能

fs.oss.connection.maximum

2048

与oss的连接数,根据需要设置

相关参数的解释可以在这里找到

重启集群,验证读写OSS

增加配置后,重启集群,重启后,可以测试

# 测试写 hadoop fs -mkdir oss://{your-bucket-name}/hadoop-test # 测试读 hadoop fs -ls oss://{your-bucket-name}/

运行teragen

[root@apache hadoop-2.7.2]# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar teragen -Dmapred.map.tasks=100 10995116 oss://{your-bucket-name}/1G-input 19/02/28 16:38:59 INFO client.RMProxy: Connecting to ResourceManager at apache/192.168.0.176:8032 19/02/28 16:39:01 INFO terasort.TeraSort: Generating 10995116 using 100 19/02/28 16:39:01 INFO mapreduce.JobSubmitter: number of splits:100 19/02/28 16:39:01 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 19/02/28 16:39:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1551343125387_0001 19/02/28 16:39:02 INFO impl.YarnClientImpl: Submitted application application_1551343125387_0001 19/02/28 16:39:02 INFO mapreduce.Job: The url to track the job: http://apache:8088/proxy/application_1551343125387_0001/ 19/02/28 16:39:02 INFO mapreduce.Job: Running job: job_1551343125387_0001 19/02/28 16:39:09 INFO mapreduce.Job: Job job_1551343125387_0001 running in uber mode : false 19/02/28 16:39:09 INFO mapreduce.Job: map 0% reduce 0% 19/02/28 16:39:18 INFO mapreduce.Job: map 1% reduce 0% 19/02/28 16:39:19 INFO mapreduce.Job: map 2% reduce 0% 19/02/28 16:39:21 INFO mapreduce.Job: map 4% reduce 0% 19/02/28 16:39:25 INFO mapreduce.Job: map 5% reduce 0% 19/02/28 16:39:28 INFO mapreduce.Job: map 6% reduce 0% ...... 19/02/28 16:42:36 INFO mapreduce.Job: map 94% reduce 0% 19/02/28 16:42:38 INFO mapreduce.Job: map 95% reduce 0% 19/02/28 16:42:41 INFO mapreduce.Job: map 96% reduce 0% 19/02/28 16:42:44 INFO mapreduce.Job: map 97% reduce 0% 19/02/28 16:42:45 INFO mapreduce.Job: map 98% reduce 0% 19/02/28 16:42:46 INFO mapreduce.Job: map 99% reduce 0% 19/02/28 16:42:48 INFO mapreduce.Job: map 100% reduce 0% 19/02/28 16:43:11 INFO mapreduce.Job: Job job_1551343125387_0001 completed successfully 19/02/28 16:43:12 INFO mapreduce.Job: Counters: 37 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=11931190 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=8497 HDFS: Number of bytes written=0 HDFS: Number of read operations=100 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 OSS: Number of bytes read=0 OSS: Number of bytes written=1099511600 OSS: Number of read operations=1100 OSS: Number of large read operations=0 OSS: Number of write operations=500 ......

运行distcp

从OSS往HDFS拷贝数据

[root@apache hadoop-2.7.2]# hadoop distcp oss://{your-bucket-name}/data hdfs:/data/input 19/03/05 09:43:59 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[oss://{your-bucket-name}/data], targetPath=hdfs:/data/input, targetPathExists=false, preserveRawXattrs=false} 19/03/05 09:43:59 INFO client.RMProxy: Connecting to ResourceManager at apache/192.168.0.176:8032 19/03/05 09:44:00 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb 19/03/05 09:44:00 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor 19/03/05 09:44:01 INFO client.RMProxy: Connecting to ResourceManager at apache/192.168.0.176:8032 19/03/05 09:44:01 INFO mapreduce.JobSubmitter: number of splits:24 19/03/05 09:44:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1551343125387_0008 19/03/05 09:44:01 INFO impl.YarnClientImpl: Submitted application application_1551343125387_0008 19/03/05 09:44:01 INFO mapreduce.Job: The url to track the job: http://apache:8088/proxy/application_1551343125387_0008/ 19/03/05 09:44:01 INFO tools.DistCp: DistCp job-id: job_1551343125387_0008 19/03/05 09:44:01 INFO mapreduce.Job: Running job: job_1551343125387_0008 19/03/05 09:44:07 INFO mapreduce.Job: Job job_1551343125387_0008 running in uber mode : false 19/03/05 09:44:07 INFO mapreduce.Job: map 0% reduce 0% 19/03/05 09:44:16 INFO mapreduce.Job: map 4% reduce 0% 19/03/05 09:44:19 INFO mapreduce.Job: map 8% reduce 0% ...... 19/03/05 09:45:11 INFO mapreduce.Job: map 96% reduce 0% 19/03/05 09:45:12 INFO mapreduce.Job: map 100% reduce 0% 19/03/05 09:45:13 INFO mapreduce.Job: Job job_1551343125387_0008 completed successfully 19/03/05 09:45:13 INFO mapreduce.Job: Counters: 38 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=2932262 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=24152 HDFS: Number of bytes written=1099511600 HDFS: Number of read operations=898 HDFS: Number of large read operations=0 HDFS: Number of write operations=251 OSS: Number of bytes read=1099511600 OSS: Number of bytes written=0 OSS: Number of read operations=2404 OSS: Number of large read operations=0 OSS: Number of write operations=0 ...... [root@apache hadoop-2.7.2]# hadoop fs -ls hdfs:/data Found 1 items drwxr-xr-x - root supergroup 0 2019-03-05 09:45 hdfs:///data/input
从HDFS往OSS拷贝数据
[root@apache hadoop-2.7.2]# hadoop distcp hdfs:/data/input oss://{your-bucket-name}/data/output 19/03/05 09:48:06 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=false, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[hdfs:/data/input], targetPath=oss://{your-bucket-name}/data/output, targetPathExists=false, preserveRawXattrs=false} 19/03/05 09:48:06 INFO client.RMProxy: Connecting to ResourceManager at apache/192.168.0.176:8032 19/03/05 09:48:06 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb 19/03/05 09:48:06 INFO Configuration.deprecation: io.sort.factor is deprecated. Instead, use mapreduce.task.io.sort.factor 19/03/05 09:48:07 INFO client.RMProxy: Connecting to ResourceManager at apache/192.168.0.176:8032 19/03/05 09:48:07 INFO mapreduce.JobSubmitter: number of splits:24 19/03/05 09:48:08 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1551343125387_0009 19/03/05 09:48:08 INFO impl.YarnClientImpl: Submitted application application_1551343125387_0009 19/03/05 09:48:08 INFO mapreduce.Job: The url to track the job: http://apache:8088/proxy/application_1551343125387_0009/ 19/03/05 09:48:08 INFO tools.DistCp: DistCp job-id: job_1551343125387_0009 19/03/05 09:48:08 INFO mapreduce.Job: Running job: job_1551343125387_0009 19/03/05 09:48:14 INFO mapreduce.Job: Job job_1551343125387_0009 running in uber mode : false 19/03/05 09:48:14 INFO mapreduce.Job: map 0% reduce 0% 19/03/05 09:48:24 INFO mapreduce.Job: map 4% reduce 0% 19/03/05 09:48:27 INFO mapreduce.Job: map 8% reduce 0% ...... 19/03/05 09:49:18 INFO mapreduce.Job: map 92% reduce 0% 19/03/05 09:49:20 INFO mapreduce.Job: map 96% reduce 0% 19/03/05 09:49:21 INFO mapreduce.Job: map 100% reduce 0% 19/03/05 09:49:22 INFO mapreduce.Job: Job job_1551343125387_0009 completed successfully 19/03/05 09:49:22 INFO mapreduce.Job: Counters: 38 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=2932910 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1099535478 HDFS: Number of bytes written=0 HDFS: Number of read operations=548 HDFS: Number of large read operations=0 HDFS: Number of write operations=48 OSS: Number of bytes read=0 OSS: Number of bytes written=1099511600 OSS: Number of read operations=1262 OSS: Number of large read operations=0 OSS: Number of write operations=405 ...... [root@apache hadoop-2.7.2]# hadoop fs -ls oss://{your-bucket-name}/data/output Found 101 items -rw-rw-rw- 1 root root 0 2019-03-05 09:48 oss://{your-bucket-name}/data/output/_SUCCESS -rw-rw-rw- 1 root root 10995200 2019-03-05 09:48 oss://{your-bucket-name}/data/output/part-m-00000 -rw-rw-rw- 1 root root 10995100 2019-03-05 09:48 oss://{your-bucket-name}/data/output/part-m-00001 ......

参考链接

https://yq.aliyun.com/articles/292792?spm=a2c4e.11155435.0.0.7ccba82fbDwfhK

https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aliyun/src/site/markdown/tools/hadoop-aliyun/index.md

背景

2017.12.13日Apache Hadoop3.0.0正式版本发布,默认支持阿里云OSS对象存储系统,作为Hadoop兼容的文件系统。

OSS是中国云计算厂商第一个也是目前唯一一个被Hadoop官方版本支持的云存储系统。这是继Docker支持阿里云存储以后,又一个更重大的里程碑。这是阿里云与社区、Intel等伙伴共同合作的成果。同时,也体现了主流开源社区对中国的技术生态,对中国云计算行业发展成果的认可。

这意味着全球用户在使用Hadoop这一开源软件时,都可以无缝连接阿里云的OSS对象存储文件系统。Hadoop生态的离线、交互、数据仓库、深度学习等程序,可以在不需要改代码的情况下,自由读写OSS的对象存储。

用户只需要简单的配置,就可以在Hadoop应用中使用OSS。下面举例介绍如何在Hadoop3.0中,使用OSS。

如何使用

Hadoop集群搭建

首先,我们需要搭建Hadoop集群,搭建步骤如下,具体可参考官方文档。如已经创建Hadoop集群,则可跳过。

配置hostname

配置各个机器的hostname,这个用户可以自由选择自己喜欢的名称。对于集群规模不大的情况,可以使用master, slave01, slave02…修改完成之后,可以使用hostname命令判断是否执行成功

修改/etc/hosts

修改各个机器上/etc/hosts文件, 在各个节点上打开文件:vim /etc/hosts
在文件最后添加如下内容,注意下面的IP地址要替换为实际环境的局域网IP地址, 例如:

192.168.1.1    master
192.168.1.2    slave01
192.168.1.3    slave02

SSH免密登录

Hadoop集群需要机器之间实现ssh直连,既不需要密码。实现方式通过将各个机器上的密钥(/.ssh/id_rsa.pub)分到给对方机器的/.ssh/authorized_keys文件中去

安装Java(所有节点)

下载JDK(以jdk1.8.0_15为例), 解压至/usr/lib/(以安装至该目录为例)
修改环境变量, 打开文件vim ~/.bashrc

export JAVA_HOME=/usr/lib/jdk1.8.0_15/
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib
export PATH=$PATH:$JAVA_HOME/bin

使配置生效 : source ~/.bashrc
验证配置是否生效 :
执行命令java -version,返回正确版本。

集群安装配置

可以参考Hadoop Cluster Setup进行Hadoop集群安装配置

阿里云OSS支持

Hadoop集群搭建好之后,为了能使Hadoop读写阿里云OSS,只需要修改极少的配置文件即可。

core_site.xml配置修改

$HADOOP_HOME/etc/hadoop/core_site.xml文件的配置内容,须增加的配置如下:
configuration标签中加入如下内容:
注意把fs.oss.endpoint、fs.oss.accessKeyId、fs.oss.accessKeySecret属性对应的的值设置为您自己的OSS Bucket的Endpoint和AK内容

<property>
      <name>fs.oss.endpoint</name>
      <value>YourEndpoint</value>
      <description>Aliyun OSS endpoint to connect to. </description>
    </property>
    <property>
      <name>fs.oss.accessKeyId</name>
      <value>YourAccessKeyId</value>
      <description>Aliyun access key ID</description>
    </property>

    <property>
      <name>fs.oss.accessKeySecret</name>
      <value>YourAccessKeySecret</value>
      <description>Aliyun access key secret</description>
    </property>
    <property>
      <name>fs.oss.impl</name>
      <value>org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem</value>
    </property>
    <property>
      <name>fs.oss.buffer.dir</name>
      <value>/tmp/oss</value>
    </property>

hadoop-env.sh配置修改

打开文件:vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh
在相应位置增加如下内容:

export HADOOP_OPTIONAL_TOOLS="hadoop-aliyun"

修改完成之后,重启Hadoop集群。

Hadoop读写OSS验证

完成上面的设置后,就可以在Hadoop中读写OSS了,享受到OSS的海量、弹性、自动扩容伸缩等优势。为了测试Hadoop能否从OSS读写文件,可以做如下的测试:

# 测试写
$HADOOP_HOME/bin/hadoop fs -mkdir oss://{your-bucket-name}/hadoop-test
# 测试读
$HADOOP_HOME/bin/hadoop fs -ls oss://${your-bucket-name}/