前几天搭建spark集群的时候, 然后跑WordCount的时候碰到了这样的一个问题集群启动成功之后, 向集群提交任务, 然后 driver程序跑到需要taskScheduler为taskSet分配资源的时候, 找不到可用的executor然后 导致driver程序这边任务执行不了, 直到超时爆出了异常, 然后只好 shutdown driver程序,最后查了查executor的日志, 发现是阻塞在了CoarseGrainedExecutorBackend中的这一句代码 SparkHadoopUtil.get.runAsSparkUser, 看到这个我就知道可能牵扯到其他的 很多东西, 然后 自己肯定不能追下去了[能力有限], 然后 搜了一些帖子, 然后 依旧没有解决掉问题



问题日志

root@slave02:/usr/local/ProgramFiles/spark-1.6.2-bin-hadoop2.6# cat work/app-20170408062155-0015/0/stderr 
17/04/08 06:22:20 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 
17/04/08 06:22:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
17/04/08 06:22:28 INFO spark.SecurityManager: Changing view acls to: root 
17/04/08 06:22:28 INFO spark.SecurityManager: Changing modify acls to: root 
17/04/08 06:22:28 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 
17/04/08 06:23:06 INFO spark.SecurityManager: Changing view acls to: root 
17/04/08 06:23:06 INFO spark.SecurityManager: Changing modify acls to: root 
17/04/08 06:23:06 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 
17/04/08 06:23:24 INFO slf4j.Slf4jLogger: Slf4jLogger started 
17/04/08 06:23:29 INFO Remoting: Starting remoting 
Exception in thread "main" 17/04/08 06:23:46 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 
17/04/08 06:23:47 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 
java.lang.reflect.UndeclaredThrowableException 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1643) 
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68) 
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:151) 
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:253) 
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) 
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds] 
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) 
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) 
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) 
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) 
at scala.concurrent.Await$.result(package.scala:107) 
at akka.remote.Remoting.start(Remoting.scala:179) 
at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184) 
at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:620) 
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:617) 
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:617) 
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:634) 
at akka.actor.ActorSystem$.apply(ActorSystem.scala:142) 
at akka.actor.ActorSystem$.apply(ActorSystem.scala:119) 
at org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121) 
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53) 
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:52) 
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:2024) 
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) 
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:2015) 
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:55) 
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:266) 
at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:217) 
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:186) 
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69) 
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:422) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) 
... 4 more





后来跑去提jira, 然后 被说了一顿, 也没有解决问题, 然后 之后跑到stackoverflow上面去提问题

apache-spark 这个版块的活跃度不高啊, 然后 依然没有解决问题, 然后 今天下午的时候, 误打误撞的更新了一下环境, 解决了这个问题, 但是 依然没有找到核心的问题在哪里 [希望之后 找到问题了, 再回来补充吧]

问题请详细见这两篇帖子

http://stackoverflow.com/questions/43315420/executorbackend-blocked-at-usergroupinformation-doas 

https://issues.apache.org/jira/browse/SPARK-20266  

参考

http://stackoverflow.com/questions/27357273/how-can-i-run-spark-job-programmatically

http://stackoverflow.com/questions/27039954/intermittent-timeout-exception-using-spark

http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-td16989.html

https://issues.streamsets.com/browse/SDC-4249

https://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Unable-to-create-SparkContext-to-Spark-1-3-Standalone-service-in/td-p/29176