The ResourceManager (RM) is responsible for tracking the resources in a cluster, and scheduling applications (e.g., MapReduce jobs).
Prior to Hadoop 2.4, the ResourceManager is the single point of failure in a YARN cluster. The High Availability feature adds redundancy in the form of an Active/Standby ResourceManager pair to remove this otherwise single point of failure
在任何一个时间点,只有一个ResourceManager是Active的,其余的一个或者多个是Standby状态。状态切换既可以通过cli手动切换,也可以通过 integrated failover-controller切换。如果是自动切换,就必须要用到zookeepe了。下面就详细介绍YARN HA自动切换模式的相关配置。
1、首先修改yarn-site.xml文件,以下蓝色字体部分为新添加内容。
[hadoop@hadoop01 hadoop]$ vi yarn-site.xml <?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <!-- Site specific YARN configuration properties --> <!--add start 20160627 --> <property> <description>The address of the applications manager interface in the RM.</description> <name>yarn.resourcemanager.address</name> <value>hadoop01:8032</value> </property> <property> <description>The address of the scheduler interface.</description> <name>yarn.resourcemanager.scheduler.address</name> <value>hadoop01:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>hadoop01:8031</value> </property> <property> <description>The address of the RM admin interface.</description> <name>yarn.resourcemanager.admin.address</name> <value>hadoop01:8033</value> </property> <property> <description>The http address of the RM web application.</description> <name>yarn.resourcemanager.webapp.address</name> <value>hadoop01:8088</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <!-- add end 20160627 --> <!-- add start 20161012 --> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>rmCluster</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <property> <name>yarn.resourcemanager.hostname.rm1</name> <value>hadoop01</value> </property> <property> <name>yarn.resourcemanager.hostname.rm2</name> <value>hadoop02</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm1</name> <value>hadoop01:8088</value> </property> <property> <name>yarn.resourcemanager.webapp.address.rm2</name> <value>hadoop02:8088</value> </property> <property> <name>yarn.resourcemanager.zk-address</name> <value>hadoop01:2181,hadoop02:2181,hadoop03:2181</value> </property> <!-- add end 20161012 --> </configuration>
|
2、在hadoop01服务器上,启动hadoop集群。("..." 部分为路径缩写),输出显示,start-all.sh只启动了一个ResourceManager。
[hadoop@hadoop01 hadoop]$ start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh 16/07/04 12:17:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting namenodes on [hadoop01 hadoop02] hadoop02: starting namenode, logging to /.../hadoop-hadoop-namenode-hadoop02.out hadoop01: starting namenode, logging to /.../hadoop-hadoop-namenode-hadoop01.out hadoop02: starting datanode, logging to /.../hadoop-hadoop-datanode-hadoop02.out hadoop01: starting datanode, logging to /.../hadoop-hadoop-datanode-hadoop01.out hadoop03: starting datanode, logging to /.../hadoop-hadoop-datanode-hadoop03.out Starting journal nodes [hadoop01 hadoop02 hadoop03] hadoop02: starting journalnode, logging to /.../hadoop-hadoop-journalnode-hadoop02.out hadoop01: starting journalnode, logging to /.../hadoop-hadoop-journalnode-hadoop01.out hadoop03: starting journalnode, logging to /.../hadoop-hadoop-journalnode-hadoop03.out Starting ZK Failover Controllers on NN hosts [hadoop01 hadoop02] hadoop02: starting zkfc, logging to /.../hadoop-hadoop-zkfc-hadoop02.out hadoop01: starting zkfc, logging to /.../hadoop-hadoop-zkfc-hadoop01.out starting yarn daemons starting resourcemanager, logging to /.../yarn-hadoop-resourcemanager-hadoop01.out hadoop01: starting nodemanager, logging to /.../yarn-hadoop-nodemanager-hadoop01.out hadoop02: starting nodemanager, logging to /.../yarn-hadoop-nodemanager-hadoop02.out hadoop03: starting nodemanager, logging to /.../yarn-hadoop-nodemanager-hadoop03.out |
3、检查hadoop集群启动进程,hadoop01机器共有以下进程。
[hadoop@hadoop01 hadoop]$ jps 5239 NodeManager 4839 JournalNode 5288 Jps 4632 DataNode 5032 DFSZKFailoverController 4521 NameNode 5116 ResourceManager
|
4、在hadoop02机器上启动ResourceManager。
[hadoop@hadoop02 ~]$ yarn-daemon.sh start resourcemanager starting resourcemanager, logging to /home/hadoop/hadoop-2.7.2//logs/yarn-hadoop-resourcemanager-hadoop02.out
|
5、分别检查两个ResourceManager的状态
[hadoop@hadoop02 ~]$ yarn rmadmin -getServiceState rm1 active [hadoop@hadoop02 ~]$ yarn rmadmin -getServiceState rm2 standby |
6、通过图形界面查看ResourceManager状态
下图表明hadoop01上的ResourceManager是active状态
下图表名hadoop02上的ResourceManager是standby状态
7、在hadoop02服务器上,手动模拟故障转移测试
[hadoop@hadoop02 ~]$ yarn rmadmin -transitionToStandby rm1 Automatic failover is enabled for org.apache.hadoop.yarn.client.RMHAServiceTarget@5d11346a Refusing to manually manage HA state, since it may cause a split-brain scenario or other incorrect state. If you are very sure you know what you are doing, please specify the --forcemanual flag.
|
8、在hadoop02服务器上,重新检查ResourceManager的状态
[hadoop@hadoop02 ~]$ yarn rmadmin -getServiceState rm2 active |
9、通过web页面检查ResourceManager状态
下图显示hadoop01服务器上的ResourceManager已经不能访问。
下图显示hadoop02服务器上的状态为active
10、另外,当我们访问standby状态的ResourceManager是,系统自动将页面重定向到active状态的ResourceManager上。
Assuming a standby RM is up and running, the Standby automatically redirects all web requests to the Active, except for the “About” page.