Yarn问题

常用的命令
1、 yarn rmadmin -getServiceState rm1 查看active或者是standby状态

2、手动切换主备
yarn rmadmin -transitionToStandby rm2 --forcemanual 将rm2主切换成备
yarn rmadmin -transitionToActive rm1 --forcemanual 将rm1备切换成主
yarn rmadmin -getServiceState rm1

1、yarn假死状态,日志一直刷新以下信息:log aggregation have not finished yet

Max number of completed apps kept in state store met: maxCompletedAppsInStateStore = 10000, but not removing app application_1618886060273_3657 from state store as log aggregation have not finished yet

yarn的BUG:https://issues.apache.org/jira/browse/YARN-4946

修复步骤是先处理standby RM,再处理active RM,RM节点替换步骤是:
(1) mv方式备份旧包: mv $HADOOP_HOME/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-$version.jar <备份路径>
(2)copy新包 ,下载高于3.2.0版本hadoop-yarn-server-resourcemanager-$version.jar包
(3)重启RM

2、yarn-labels

先yarn-site配置

yarn.node-labels.enabled=true
yarn.node-labels.fs-store.root-dir=hdfs://namenode:port/path/node-labels/

再master节点执行:

##添加标签
 yarn rmadmin -addToClusterNodeLabels "label_1(exclusive=true/false),label_2(exclusive=true/false)" 
## exclusive 默认(true)
##查看标签
 yarn cluster --list-node-labels
##删除YARN Node Labels
yarn rmadmin -removeFromClusterNodeLabels "<label>[,<label>,...]"
##增加/修改标签映射
yarn rmadmin -replaceLabelsOnNode “node1[:port]=label1 node2=label2” [-failOnUnknownNodes]
 ##node1的地址需要到yarn页面看nodes查看

队列配置

<configuration>
  <property>
    <name>yarn.scheduler.capacity.maximum-applications</name>
    <value>10000</value>
    <description>Maximum number of applications that can be pending and running.</description>
  </property>
  <property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>0.25</value>
    <description>Maximum percent of resources in the cluster which can be used to run application masters i.e. controls number of concurrent running applications.</description>
  </property>
  <property>
    <name>yarn.scheduler.capacity.resource-calculator</name>
    <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>
    <description>The ResourceCalculator implementation to be used to compare Resources in the scheduler.The default i.e. DefaultResourceCalculator only uses Memory while DominantResourceCalculator uses dominant-resource to compare multi-dimensional resources such as Memory, CPU etc.</description>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>default,a1,a2,a3</value>
    <description>The queues at the this level (root is the root queue).</description>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.a1.accessible-node-labels</name>
    <value>test1</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.a2.accessible-node-labels</name>
    <value>test2</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.a3.accessible-node-labels</name>
    <value>test2</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.a1.accessible-node-labels.test1.capacity</name>
    <value>100</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.a2.accessible-node-labels.test2.capacity</name>
    <value>30</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.a3.accessible-node-labels.test2.capacity</name>
    <value>70</value>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.capacity</name>
    <value>50</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.a1.capacity</name>
    <value>20</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.a2.capacity</name>
    <value>20</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.a3.capacity</name>
    <value>10</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.a1.maximum-capacity</name>
    <value>100</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.a2.maximum-capacity</name>
    <value>100</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.node-locality-delay</name>
    <value>-1</value>
    <description>Number of missed scheduling opportunities after which the CapacityScheduler attempts to schedule rack-local containers. Typically this should be set to number of nodes in the cluster.</description>
  </property>
  <property>
    <name>yarn.scheduler.capacity.queue-mappings</name>
    <value></value>
    <description>A list of mappings that will be used to assign jobs to queues. The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]* Typically this list will be used to map users to queues,for example, u:%user:%user maps all users to queues with the same name as the user.</description>
  </property>
  <property>
    <name>yarn.scheduler.capacity.queue-mappings-override.enable</name>
    <value>false</value>
    <description>If a queue mapping is present, will it override the value specified by the user? This can be used by administrators to place jobs in queues that are different than the one specified by the user. The default is false.</description>
  </property>

</configuration>

重启yarn RM

3.yarn因为磁盘不够导致异常

扩容即可
2020-09-17 13:12:31,660 ERROR org.mortbay.log: /ws/v1/cluster/apps
javax.ws.rs.WebApplicationException: javax.xml.bind.MarshalException

  • with linked exception:
    [javax.xml.stream.XMLStreamException: org.mortbay.jetty.EofException]
    at com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:159)
    at com.sun.jersey.spi.container.ContainerResponse.write(ContainerResponse.java:306)
    at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1437)
    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
    at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
    at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
    at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
    at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
    at org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:142)
    at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
    at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
    at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
    at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
    at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
    at org.mortbay.jetty.servlet.ServletHandlerjava主备切换 yarn主备切换_java主备切换StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
    at org.mortbay.jetty.servlet.ServletHandlerjava主备切换 yarn主备切换_java主备切换_02CachedChain.doFilter(ServletHandler.java:1212)
    at org.apache.hadoop.http.HttpServer2java主备切换 yarn主备切换_java_03CachedChain.doFilter(ServletHandler.java:1212)
    at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
    at org.mortbay.jetty.servlet.ServletHandlerjava主备切换 yarn主备切换_java主备切换_04CachedChain.doFilter(ServletHandler.java:1212)
    at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
    at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
    at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
    at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
    at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
    at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
    at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
    at org.mortbay.jetty.Server.handle(Server.java:326)
    at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
    at org.mortbay.jetty.HttpConnectionjava主备切换 yarn主备切换_ci_05PoolThread.run(QueuedThreadPool.java:582)
    Caused by: javax.xml.bind.MarshalException
  • with linked exception:
    [javax.xml.stream.XMLStreamException: org.mortbay.jetty.EofException]
    at com.sun.xml.bind.v2.runtime.MarshallerImpl.write(MarshallerImpl.java:327)
    at com.sun.xml.bind.v2.runtime.MarshallerImpl.marshal(MarshallerImpl.java:177)
    at com.sun.jersey.json.impl.BaseJSONMarshaller.marshallToJSON(BaseJSONMarshaller.java:103)
    at com.sun.jersey.json.impl.provider.entity.JSONRootElementProvider.writeTo(JSONRootElementProvider.java:136)
    at com.sun.jersey.core.provider.jaxb.AbstractRootElementProvider.writeTo(AbstractRootElementProvider.java:157)
    … 43 more
    Caused by: javax.xml.stream.XMLStreamException: org.mortbay.jetty.EofException
    at com.sun.jersey.json.impl.writer.Stax2JacksonWriter.writeStartElement(Stax2JacksonWriter.java:204)
    at com.sun.xml.bind.v2.runtime.output.XMLStreamWriterOutput.beginStartTag(XMLStreamWriterOutput.java:118)
    at com.sun.xml.bind.v2.runtime.output.XmlOutputAbstractImpl.beginStartTag(XmlOutputAbstractImpl.java:102)
    at com.sun.xml.bind.v2.runtime.output.NamespaceContextImpljava主备切换 yarn主备切换_java_06Output.blockForOutput(AbstractGenerator.java:551)
    at org.mortbay.jetty.AbstractGeneratorjava主备切换 yarn主备切换_yarn_07Output.flush(HttpConnection.java:1012)
    at org.mortbay.jetty.AbstractGeneratorjava主备切换 yarn主备切换_java_08Output.write(AbstractGenerator.java:580)
    at com.sun.jersey.spi.container.servlet.WebComponentjava主备切换 yarn主备切换_java_09CommittingOutputStream.write(ContainerResponse.java:134)
    at sun.nio.cs.StreamEncoder.writeBytes(StreamEncoder.java:221)
    at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:282)
    at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
    at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)
    at org.codehaus.jackson.impl.WriterBasedGenerator._flushBuffer(WriterBasedGenerator.java:1812)
    at org.codehaus.jackson.impl.WriterBasedGenerator._writeString(WriterBasedGenerator.java:987)
    at org.codehaus.jackson.impl.WriterBasedGenerator._writeFieldName(WriterBasedGenerator.java:328)
    at org.codehaus.jackson.impl.WriterBasedGenerator.writeFieldName(WriterBasedGenerator.java:197)
    at com.sun.jersey.json.impl.writer.JacksonStringMergingGenerator.writeFieldName(JacksonStringMergingGenerator.java:140)
    at com.sun.jersey.json.impl.writer.Stax2JacksonWriter.writeStartElement(Stax2JacksonWriter.java:183)
    … 63 more
    Caused by: java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)

4.yarn队列提交显示队列不存在

用户有个root.etl.streaming的队列,但是使用使用queue=root.etl.streaming报错没有队列,使用queue=etl.streaming也报错没有队列
解决办法:queue指定队列只需要指定最后的队列名,所以queue=streaming后成功

5、用户提交作业一直ACCEPTED状态

1)发现ACCEPTED作业的描述信息中有如下日志:

yarn.scheduler.capacity.maximum-am-resource-percent默认0.25可以适当调大

2)还有一种是对应的队列满了,比如下面情况:
一共34000,已经使用了32768,用户提交任务一种处于ACCEPTED

调大队列的大小~
3)用户有2个队列,default队列分配75%的资源,但是设置最大资源是100%,当使用到75%时任务处于ACCEPTED

原因时yarn.scheduler.capacity.root.default.user-limit-factor参数默认时1,标识一个用户只能使用队列分配的最大值75%,当yarn.scheduler.capacity.root.default.user-limit-factor设置成2时,就可以继续提交到空闲内存
,或者换一个用户提交作业也可以提交到空闲内存

6、用户设置队列acl的提交权限,但是不生效

必须对root设置权限

7、yarn节点下线后再添加回去

1、修改去除已下线节点的地址/etc/ecm/hadoop-conf/yarn.exclude
2、header节点执行 yarn rmadmin -refreshNodes
3、emr控制台启动nm

8、curl http://hadoop:8088/cluster 显示没权限

新版本hadoop.http.authentication.simple.anonymous.allowed这个值为false导致访问yarn页面的时候要加上?user.name=xxxxxx,否则没有权限
或者打开权限hadoop.http.authentication.simple.anonymous.allowed改为true,然后重启yarn

9、emr集群重启后2个rm全都standby

问题描述:用户重启rm后2个rm都是standby状态,

先使用hadoop用户执行命令yarn rmadmin -transitionToActive rm1 --forcemanual强制rm1切换成active,但是不成功
看报错信息是zk出现问题,但是使用zkCli.sh是没有问题的,开源资料查询需要调大
“-Djute.maxbuffer=10000000”
逐个重启zk,先启动follower
还是报错,在yarn的环境也添加上/var/lib/ecm-agent/cache/ecm/service/YARN/x.x.x.x.x/package/templates 里找到yarn-env.sh YARN_OPTS = “-Djute.maxbuffer=10000000”
重启yarn,还是报错

使用命令yarn resourcemanager -format-state-store格式化rm的状态存储,

还是报错,只能手动去zk删除对应的任务znode状态
deleteall /rmstore/ZKRMStateRoot/RMAppRoot/application_ 1595251161356_11443
删除后成功
解决办法:作业触发了 ResourceManager zk的异常,修复zk 问题以后重启RM 卡在了一个异常作业上1595251161356_11443,这个情况下,可以清理zk store快速恢复,但是有个异常,所以手动清理了异常作业的zk信息

10、spark on yarn 显示提交作业的vcore数量和参数提交的对不上

yarn.scheduler.capacity.resource-calculator org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator 改成:org.apache.hadoop.yarn.util.resource.DominantResourceCalculator

11、yarn查看日志报错:User [dr.who] is not authorized to view the logs for container_1623238721588_0301_02_000001 in log file

原因是开启了acl的权限