The ResourceManager (RM) is responsible for tracking the resources in a cluster, and scheduling applications (e.g., MapReduce jobs).

        Prior to Hadoop 2.4, the ResourceManager is the single point of failure in a YARN cluster.         The High Availability feature adds redundancy in the form of an Active/Standby ResourceManager pair to remove this otherwise single point of failure

        在任何一个时间点,只有一个ResourceManager是Active的,其余的一个或者多个是Standby状态。状态切换既可以通过cli手动切换,也可以通过 integrated failover-controller切换。如果是自动切换,就必须要用到zookeepe了。下面就详细介绍YARN HA自动切换模式的相关配置。

    1、首先修改yarn-site.xml文件,以下蓝色字体部分为新添加内容。

[hadoop@hadoop01 hadoop]$ vi yarn-site.xml

    
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<configuration>
<!-- Site specific YARN configuration properties -->
<!--add start 20160627 -->
  <property>
      <description>The address of the applications manager interface in the RM.</description>
      <name>yarn.resourcemanager.address</name>
      <value>hadoop01:8032</value>
  </property>
  <property>
      <description>The address of the scheduler interface.</description>
      <name>yarn.resourcemanager.scheduler.address</name>
      <value>hadoop01:8030</value>
  </property>
  <property>
      <name>yarn.resourcemanager.resource-tracker.address</name>
      <value>hadoop01:8031</value>
  </property>
  <property>
      <description>The address of the RM admin interface.</description>
      <name>yarn.resourcemanager.admin.address</name>
      <value>hadoop01:8033</value>
  </property>
  <property>
      <description>The http address of the RM web application.</description>
      <name>yarn.resourcemanager.webapp.address</name>
      <value>hadoop01:8088</value>
  </property>
  <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
  </property>
  <property>
     <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <!-- add end 20160627 -->

  <!-- add start 20161012 -->
  <property>
     <name>yarn.resourcemanager.ha.enabled</name>
     <value>true</value>
  </property>
  <property>
     <name>yarn.resourcemanager.cluster-id</name>
     <value>rmCluster</value>
  </property>
  <property>
     <name>yarn.resourcemanager.ha.rm-ids</name>
     <value>rm1,rm2</value>
  </property>
  <property>
     <name>yarn.resourcemanager.hostname.rm1</name>
     <value>hadoop01</value>
  </property>
  <property>
     <name>yarn.resourcemanager.hostname.rm2</name>
     <value>hadoop02</value>
  </property>
  <property>
     <name>yarn.resourcemanager.webapp.address.rm1</name>
     <value>hadoop01:8088</value>
  </property>
  <property>
     <name>yarn.resourcemanager.webapp.address.rm2</name>
     <value>hadoop02:8088</value>
  </property>
  <property>
     <name>yarn.resourcemanager.zk-address</name>
     <value>hadoop01:2181,hadoop02:2181,hadoop03:2181</value>
  </property>
  <!-- add end 20161012 -->

</configuration>

    2、在hadoop01服务器上,启动hadoop集群。("..." 部分为路径缩写),输出显示,start-all.sh只启动了一个ResourceManager。

[hadoop@hadoop01 hadoop]$ start-all.sh


This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
16/07/04 12:17:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [hadoop01 hadoop02]
hadoop02: starting namenode, logging to /.../hadoop-hadoop-namenode-hadoop02.out
hadoop01: starting namenode, logging to /.../hadoop-hadoop-namenode-hadoop01.out
hadoop02: starting datanode, logging to /.../hadoop-hadoop-datanode-hadoop02.out
hadoop01: starting datanode, logging to /.../hadoop-hadoop-datanode-hadoop01.out
hadoop03: starting datanode, logging to /.../hadoop-hadoop-datanode-hadoop03.out
Starting journal nodes [hadoop01 hadoop02 hadoop03]
hadoop02: starting journalnode, logging to /.../hadoop-hadoop-journalnode-hadoop02.out
hadoop01: starting journalnode, logging to /.../hadoop-hadoop-journalnode-hadoop01.out
hadoop03: starting journalnode, logging to /.../hadoop-hadoop-journalnode-hadoop03.out
Starting ZK Failover Controllers on NN hosts [hadoop01 hadoop02]
hadoop02: starting zkfc, logging to /.../hadoop-hadoop-zkfc-hadoop02.out
hadoop01: starting zkfc, logging to /.../hadoop-hadoop-zkfc-hadoop01.out
starting yarn daemons
starting resourcemanager, logging to /.../yarn-hadoop-resourcemanager-hadoop01.out
hadoop01: starting nodemanager, logging to /.../yarn-hadoop-nodemanager-hadoop01.out
hadoop02: starting nodemanager, logging to /.../yarn-hadoop-nodemanager-hadoop02.out
hadoop03: starting nodemanager, logging to /.../yarn-hadoop-nodemanager-hadoop03.out

    3、检查hadoop集群启动进程,hadoop01机器共有以下进程。

[hadoop@hadoop01 hadoop]$ jps
5239 NodeManager
4839 JournalNode
5288 Jps
4632 DataNode
5032 DFSZKFailoverController
4521 NameNode
5116 ResourceManager

    4、在hadoop02机器上启动ResourceManager。

[hadoop@hadoop02 ~]$ yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /home/hadoop/hadoop-2.7.2//logs/yarn-hadoop-resourcemanager-hadoop02.out

    5、分别检查两个ResourceManager的状态

[hadoop@hadoop02 ~]$ yarn rmadmin -getServiceState rm1
active
[hadoop@hadoop02 ~]$ yarn rmadmin -getServiceState rm2
standby

    6、通过图形界面查看ResourceManager状态

下图表明hadoop01上的ResourceManager是active状态

大数据:从入门到XX(九) _zookeeper

下图表名hadoop02上的ResourceManager是standby状态

大数据:从入门到XX(九) _ha_02


    7、在hadoop02服务器上,手动模拟故障转移测试

[hadoop@hadoop02 ~]$ yarn rmadmin -transitionToStandby rm1
Automatic failover is enabled for org.apache.hadoop.yarn.client.RMHAServiceTarget@5d11346a
Refusing to manually manage HA state, since it may cause
a split-brain scenario or other incorrect state.
If you are very sure you know what you are doing, please
specify the --forcemanual flag.


    8、在hadoop02服务器上,重新检查ResourceManager的状态

[hadoop@hadoop02 ~]$ yarn rmadmin -getServiceState rm2
active

    9、通过web页面检查ResourceManager状态

 下图显示hadoop01服务器上的ResourceManager已经不能访问。

大数据:从入门到XX(九) _ha_03

下图显示hadoop02服务器上的状态为active

大数据:从入门到XX(九) _ha_04

10、另外,当我们访问standby状态的ResourceManager是,系统自动将页面重定向到active状态的ResourceManager上。

Assuming a standby RM is up and running, the Standby automatically redirects all web requests to the Active, except for the “About” page.