前几天搭建spark集群的时候, 然后跑WordCount的时候碰到了这样的一个问题
集群启动成功之后, 向集群提交任务, 然后 driver程序跑到需要taskScheduler为taskSet分配资源的时候, 找不到可用的executor
然后 导致driver程序这边任务执行不了, 直到超时爆出了异常, 然后只好 shutdown driver程序,
最后查了查executor的日志, 发现是阻塞在了CoarseGrainedExecutorBackend中的这一句代码 SparkHadoopUtil.get.runAsSparkUser, 看到这个我就知道可能牵扯到其他的 很多东西, 然后 自己肯定不能追下去了[能力有限], 然后 搜了一些帖子, 然后 依旧没有解决掉问题
问题日志
root@slave02:/usr/local/ProgramFiles/spark-1.6.2-bin-hadoop2.6# cat work/app-20170408062155-0015/0/stderr
17/04/08 06:22:20 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
17/04/08 06:22:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/08 06:22:28 INFO spark.SecurityManager: Changing view acls to: root
17/04/08 06:22:28 INFO spark.SecurityManager: Changing modify acls to: root
17/04/08 06:22:28 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
17/04/08 06:23:06 INFO spark.SecurityManager: Changing view acls to: root
17/04/08 06:23:06 INFO spark.SecurityManager: Changing modify acls to: root
17/04/08 06:23:06 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
17/04/08 06:23:24 INFO slf4j.Slf4jLogger: Slf4jLogger started
17/04/08 06:23:29 INFO Remoting: Starting remoting
Exception in thread "main" 17/04/08 06:23:46 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
17/04/08 06:23:47 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1643)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:151)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:253)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at akka.remote.Remoting.start(Remoting.scala:179)
at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:620)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:617)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:617)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:634)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:142)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:119)
at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52)
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2024)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2015)
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:55)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:266)
at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:217)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:186)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
... 4 more
后来跑去提jira, 然后 被说了一顿, 也没有解决问题, 然后 之后跑到stackoverflow上面去提问题
apache-spark 这个版块的活跃度不高啊, 然后 依然没有解决问题, 然后 今天下午的时候, 误打误撞的更新了一下环境, 解决了这个问题, 但是 依然没有找到核心的问题在哪里 [希望之后 找到问题了, 再回来补充吧]
问题请详细见这两篇帖子
http://stackoverflow.com/questions/43315420/executorbackend-blocked-at-usergroupinformation-doas
https://issues.apache.org/jira/browse/SPARK-20266
参考 :
http://stackoverflow.com/questions/27357273/how-can-i-run-spark-job-programmatically
http://stackoverflow.com/questions/27039954/intermittent-timeout-exception-using-spark
https://issues.streamsets.com/browse/SDC-4249