1.环境

centos 7,jdk 1.8,hadoop-2.7.7

已配置java环境变量JAVA_HOME

2.集群搭建

首先在master主节点上完成解压和配置,再copy到其他节点上

在master节点上,解压

cd /opt

tar -zvxf hadoop-2.7.7.tar.gz

ln -s hadoop-2.7.7/ hadoop

配置环境变量

vim /etc/profile

追加

HADOOP_HOME=/opt/hadoop

PATH=...:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:

export  HADOOP_HOME

完成解压和环境变量配置,开始修改hadoop配置文件。

从官网介绍来看,hadoop配置相对其他分布式系统要复杂多一点,但其实很多配置是不需要单独配置,只需要使用默认配置即可。

本文目的在于以最小的配置来保证集群正常运行,故不对配置做详细分析,仅修改尽可能少的配置。

主要需要配置etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-site.xml and etc/hadoop/mapred-site.xml几个配置文件,这些配置可以参考core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml默认配置文件,详细配置项可通过官网找到解释hadoop官网配置

cd /opt/hadoop/etc/hadoop

编辑环境配置文件vim hadoop-env.sh

根据官网描述,至少需要改设置一个java环境变量

hadoop k8s集群 hadoop2.7.2集群教程_mapreduce

 

另外,大多数情况下需要自己指定id目录和日志目录

hadoop k8s集群 hadoop2.7.2集群教程_xml_02

故本文只修改这三个配置。理论上来说,环境变量已经设置过,id目录和日志目录有默认值,应该是可以不用配置的。

编辑核心配置文件vim core-site.xml,原有配置文件是空的需自己加上,按照官网加上两项配置

hadoop k8s集群 hadoop2.7.2集群教程_mapreduce_03

<configuration>
  <property>
   <!-- The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.-->
    <name>fs.defaultFS</name>
    <value>hdfs://master:9000</value>
  </property>
  <property>
    <!--Size of read/write buffer used in SequenceFiles. byte -->
    <name>io.file.buffer.size</name>
    <value>131072</value>
  </property>
</configuration>

编辑hdfs配置文件vim hdfs-site.xml,

官网描述

hadoop k8s集群 hadoop2.7.2集群教程_hadoop_04

根据官网默认配置文件,如下:

hadoop k8s集群 hadoop2.7.2集群教程_hadoop_05

hadoop k8s集群 hadoop2.7.2集群教程_hadoop_06

可以看到,dfs.namenode.name.dir和dfs.datanode.data.dir都是基于hadoop.tmp.dir,故在core-site.xml中追加该配置项

<property>
    <!-- A base for other temporary directories. -->
    <name>hadoop.tmp.dir</name>
    <value>/data/hadoop</value>
  </property>

继续修改hdfs-site.xml,

<configuration>
  <property>
    <!-- HDFS blocksize of 256MB for large file-systems.  default 128MB-->
    <name>dfs.blocksize</name>
    <value>268435456</value>
  </property>
   <property>
    <!-- More NameNode server threads to handle RPCs from large number of DataNodes. default 10-->
    <name>dfs.namenode.handler.count</name>
    <value>100</value>
  </property>
</configuration>

dfs.namenode.name.dir和dfs.datanode.data.dir不再单独配置,

这里配置的两项也可以不用配置,官网给出的配置是针对规模较大的集群的较高配置。

编辑yarn配置文件 vim yarn-site.xml

<configuration>
	<!-- Configurations for ResourceManager and NodeManager -->
	<property>
		<!-- Enable ACLs? Defaults to false. -->
		<name>yarn.acl.enable</name>
		<value>false</value>
	</property>

	<property>
		<!-- ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access. -->
		<name>yarn.admin.acl</name>
		<value>*</value>
	</property>

	<property>
		<!-- Configuration to enable or disable log aggregation -->
		<name>yarn.log-aggregation-enable</name>
		<value>false</value>
	</property>

	<!-- Configurations for ResourceManager -->

	<property>
		<!-- host Single hostname that can be set in place of setting all yarn.resourcemanager*address resources. Results in default ports for ResourceManager components. -->
		<name>yarn.resourcemanager.hostname</name>
		<value>master</value>
	</property>

	<property>
		<!-- CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler -->
		<name>yarn.resourcemanager.scheduler.class</name>
		<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
	</property>

	<property>
		<!--The minimum allocation for every container request at the RM, in MBs. Memory requests lower than this will throw a InvalidResourceRequestException.-->
		<name>yarn.scheduler.minimum-allocation-mb</name>
		<value>1024</value>
	</property>

	<property>
		<!--The maximum allocation for every container request at the RM, in MBs. Memory requests higher than this will throw a InvalidResourceRequestException.-->
		<name>yarn.scheduler.maximum-allocation-mb</name>
		<value>8192</value>
	</property>

	<!--Configurations for NodeManager-->
	
	<property>
		<!-- Defines total available resources on the NodeManager to be made available to running containers -->
		<name>yarn.nodemanager.resource.memory-mb</name>
		<value>8192</value>
	</property>
	<property>
		<!--Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio.-->
		<name>yarn.nodemanager.vmem-pmem-ratio</name>
		<value>2.1</value>
	</property>
	
	<property>
		<!-- Where to store container logs. An application's localized log directory will be found in ${yarn.nodemanager.log-dirs}/application_${appid}. Individual containers' log directories will be below this, in directories named container_{$contid}. Each container directory will contain the files stderr, stdin, and syslog generated by that container.-->
		<name>yarn.nodemanager.log-dirs</name>
		<value>/data/hadoop/log</value>
	</property>
	
	<property>
		<!--HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.  -->
		<name>yarn.nodemanager.remote-app-log-dir</name>
		<value>/data/hadoop/log</value>
	</property>
</configuration>

其中大部分使用默认配置,主要改动修改了resourcemanager的host为master,其他修改了日志路径

编辑配置mapreduce配置文件

vim mapred-site.xml

<configuration>
	<!-- Configurations for MapReduce Applications -->
	<property>
		<!-- Execution framework set to Hadoop YARN. -->
		<name>mapreduce.framework.name</name>
		<value>yarn</value>
	</property>

	<property>
		<!-- The amount of memory to request from the scheduler for each map task. -->
		<name>mapreduce.map.memory.mb</name>
		<value>1536</value>
	</property>

	<property>
		<!-- Larger heap-size for child jvms of maps. -->
		<name>mapreduce.map.java.opts</name>
		<value>-Xmx1024M</value>
	</property>
	
	<property>
		<!-- Larger resource limit for reduces. -->
		<name>mapreduce.reduce.memory.mb</name>
		<value>3072</value>
	</property>
	
	<property>
		<!-- Larger heap-size for child jvms of reduces. -->
		<name>mapreduce.reduce.java.opts</name>
		<value>-Xmx2560M</value>
	</property>
	
	<property>
		<!-- The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.-->
		<name>mapreduce.task.io.sort.mb</name>
		<value>512</value>
	</property>
	
	<property>
		<!-- The number of streams to merge at once while sorting files. This determines the number of open file handles.-->
		<name>mapreduce.task.io.sort.factor</name>
		<value>100</value>
	</property>
	
	<property>
		<!--The default number of parallel transfers run by reduce during the copy(shuffle) phase.-->
		<name>mapreduce.reduce.shuffle.parallelcopies</name>
		<value>50</value>
	</property>

	<!--Configurations for MapReduce JobHistory Server-->
	<property>
		<!--MapReduce JobHistory Server IPC host:port-->
		<name>mapreduce.jobhistory.address</name>
		<value>master:10020</value>
	</property>
	
	<property>
		<!--MapReduce JobHistory Server Web UI host:port-->
		<name>mapreduce.jobhistory.webapp.address</name>
		<value>master:19888</value>
	</property>
	
	<property>
		<!--Directory where history files are written by MapReduce jobs.-->
		<name>mapreduce.jobhistory.intermediate-done-dir</name>
		<value>/data/hadoop/mr/tmp</value>
	</property>
	
	<property>
		<!--Directory where history files are managed by the MR JobHistory Server.-->
		<name>mapreduce.jobhistory.done-dir</name>
		<value>/data/hadoop/mr</value>
	</property>
</configuration>

这里核心配置是mapreduce.framework.name=yarn,即使用yarn来管理mapreduce,默认配置为local,另外,将job history server的host配置为master,端口为默认端口,修改jobhistory server的日志路径。其他一些配置只是调大了内存等参数。

接下来,按照官网介绍,可以自己编写脚本和配置,来定期检测nodemanager的健康状态,跳过。

编辑slaves文件,删除原有的localhost,追加slave1和slave2,各占一行。

机架感知和log4j日志配置暂时跳过。

完成配置后,将hadoop安装目录连同配置一起拷贝到slave1和slave2节点的相同目录下,并为slave节点添加环境变量配置。

完成后,开始启动集群。

为了方便,不一个个启动,直接运行start-dfs.sh启动HDFS,运行start-yarn.sh启动yarn,运行mr-jobhistory-daemon.sh --config $HADOOP_HOME/etc/hadoop start historyserver启动mapreduce job服务器。

启动成功后,分别观察日志,如果没有发现报错,则说明应该集群启动成功了。

可以在浏览器分别访问以下地址,

hadoop k8s集群 hadoop2.7.2集群教程_hadoop_07

正常进入页面,集群搭建完毕。

(坑点记录:

hadoop k8s集群 hadoop2.7.2集群教程_hadoop k8s集群_08

这项配置误以为是namenode的webui端口,所以配置了50070默认端口,导致启动时报错端口已被占用,实际webui端口已占用50070。查询官网提供的默认配置文件发现, 该配置的意思是指hdfs文件系统绑定的端口。