Spark On Yarn 远程idea提交运行(不是调试)

1. 需要注意的问题

1.1 centos搭建的集群会出现is running beyond virtual memory limits的问题

Current usage: xx MB of xxGB physical memory used; xx GB of xx GB virtual memory used.

解决方法:

# yarn-site.xml中添加以下属性
        <property>
                <name>yarn.nodemanager.vmem-check-enabled</name>
                <value>false</value>
        </property>

1.2 在linux下使用idea连接docker搭建的集群,之间虽然能够互相ping通,但是还是有防火墙依然会让集群不能访问宿主机

19/01/21 16:44:16 INFO Client: Application report for application_1548058747747_0006 (state: ACCEPTED)

程序运行一直出现这个记录, 解决办法:关闭防火墙

1.3 宿主机占不到集群,一直使用0.0.0.0:8032端口(这一步设置很重要)

这是因为没有把resource资源文件设置成资源文件, 解决方案:
右键点击resource文件,选择Mark Directory as >> Resources root

2. 最终文件形式(src部分)

在idea新建项目, sbt构建项目, sbt版本随意, scala版本选择2.11.8, 因为我的集群中没有专门配置scala,因此用spark-2.3.1-bin-hadoop2.7自带的scala, 其版本号就是2.11.8, src目录如下

# 右键点击resource选择Mark Directory as >> Resources root, 或者去project struct设置
src
├── main
│   ├── resource
│   │   ├── core-site.xml
│   │   ├── hdfs-site.xml
│   │   └── yarn-site.xml
│   └── scala
│       ├── SparkPI.scala
│       └── WordCount.scala
└── test
    └── scala

2.1 以提交wordcount为例子

单单这些代码是不能运行的,还需要设置集群,1) 添加集群jars包, 2) 使用sbt打包

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object WordCount {

  def main(args: Array[String]): Unit = {
    System.setProperty("HADOOP_USER_NAME", "root")
    System.setProperty("user.name", "root")

    val conf = new SparkConf().setAppName("WordCount").setMaster("yarn")
      .set("deploy-mode", "client")
      .set("spark.yarn.jars", "hdfs:/user/root/jars/*")  //集群的jars包,是你自己上传上去的
      .setJars(List("/home/lee/IdeaProjects/test/target/scala-2.11/test_2.11-0.1.jar")) //这是sbt打包后的文件
      .setIfMissing("spark.driver.host", "192.168.1.9") //设置你自己的ip

    val sc = new SparkContext(conf)

    val rdd = sc.textFile("hdfs:/input/README.txt")
    val count = rdd.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)
    count.collect().foreach(println)
  }
}

2.2 依赖

# build.sbt中添加一下内容
// https://mvnrepository.com/artifact/org.apache.spark/spark-yarn
libraryDependencies += "org.apache.spark" %% "spark-yarn" % "2.3.1"

3. 步骤

3.1 设置jars

注意 wordcountconf中的.set("spark.yarn.jars", "hdfs:/user/root/jars/*"),这里面由于没有在本地添加spark的jars包,因此直接使用集群中的jars包, 这个包需要在集群里面提交

# 在docker环境下, 可以使用如下指令
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /input
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /user
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /user/root
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -mkdir /user/root/jars
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -put /opt/module/spark/jars/* /user/root/jars
docker exec spark-master /opt/module/hadoop/bin/hdfs dfs -put /opt/module/hadoop/README.txt /input
# /opt/module/hadoop/ 是你自己的hadoop目录
# /opt/module/spark/ 是你自己的spark目录
# 在集群中,假如环境都设置好了,那么就可以
hdfs dfs -mkdir /input
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/root
hdfs dfs -mkdir /user/root/jars
hdfs dfs -put  your_spark_path/jars/* /user/root/jars
hdfs dfs -put /opt/module/hadoop/README.txt /input

当然如果你不喜欢用/user/root目录来放jars,那么也可以自定义,当然在wordcount里面就要做出对应改变了。

3.2 选用本地jars包(与3.1二选一)

如果不想提交spark的jars包到集群,那么可以把spark的jars可以复制到项目里

ls /opt/module/spark
bin  conf  data  examples  jars  kubernetes  LICENSE  licenses  logs  NOTICE  python  R  README.md  RELEASE  sbin  work  yarn

对就是SPARK_HOME目录下的jars文件夹, 复制到项目, 最终你的 your_project/jars里面应该是下面这些内容

activation-1.1.1.jar                         hadoop-yarn-client-2.7.3.jar               metrics-graphite-3.1.5.jar 
......
zstd-jni-1.3.2-2.jar
hadoop-yarn-api-2.7.3.jar                    metrics-core-3.1.5.jar

选择file>>project structure>>module, 选择name方框下的dependecies,在点击该栏目右上方的+号, 选择1. jars and Directories, 再弹出框中选择 your_project/jars

3.3 打包

在idea底部选择sbt shell
第一次输入clean 第二次输入package 如果选择其他的打包方式,那就需要修改confsetJars

4. 运行

19/01/21 16:44:41 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 
19/01/21 16:44:41 INFO DAGScheduler: ResultStage 1 (collect at WordCount.scala:20) finished in 0.827 s
19/01/21 16:44:41 INFO DAGScheduler: Job 0 finished: collect at WordCount.scala:20, took 6.945556 s
(under,1)
(this,3)
(distribution,2)
(Technology,1)
(country,1)
(is,1)
(Jetty,1)
(currently,1)
(permitted.,1)
(check,1)
(have,1)
(Security,1)
(U.S.,1)
(with,1)
(BIS,1)
(This,1)
(mortbay.org.,1)
((ECCN),1)
(using,2)
(security,1)
(Department,1)
(export,1)
(reside,1)
(any,1)
(algorithms.,1)
(from,1)
(re-export,2)
(has,1)
(SSL,1)
(Industry,1)
(Administration,1)
(details,1)
(provides,1)
(http://hadoop.apache.org/core/,1)
(country's,1)
(Unrestricted,1)
(740.13),1)
(policies,1)
(country,,1)
(concerning,1)
(uses,1)
(Apache,1)
(possession,,2)
(information,2)
(our,2)
(as,1)
(,18)
(Bureau,1)
(wiki,,1)
(please,2)
(form,1)
(information.,1)
(ENC,1)
(Export,2)
(included,1)
(asymmetric,1)
(Commodity,1)
(Software,2)
(For,1)
(it,1)
(The,4)
(about,1)
(visit,1)
(website,1)
(<http://www.wassenaar.org/>,1)
(performing,1)
(Section,1)
(on,2)
((see,1)
(http://wiki.apache.org/hadoop/,1)
(classified,1)
(following,1)
(in,1)
(object,1)
(cryptographic,3)
(which,2)
(See,1)
(encryption,3)
(Number,1)
(and/or,1)
(software,2)
(for,3)
((BIS),,1)
(makes,1)
(at:,2)
(manner,1)
(Core,1)
(latest,1)
(your,1)
(may,1)
(the,8)
(Exception,1)
(includes,2)
(restrictions,1)
(import,,2)
(project,1)
(you,1)
(use,,2)
(another,1)
(if,1)
(or,2)
(Commerce,,1)
(source,1)
(software.,2)
(laws,,1)
(BEFORE,1)
(Hadoop,,1)
(License,1)
(written,1)
(code,1)
(Regulations,,1)
(software,,2)
(more,2)
(software:,1)
(see,1)
(regulations,1)
(of,5)
(libraries,1)
(by,1)
(exception,1)
(Control,1)
(code.,1)
(eligible,1)
(both,1)
(to,2)
(Foundation,1)
(Government,1)
(functions,1)
(and,6)
(5D002.C.1,,1)
((TSU),1)
(Hadoop,1)
19/01/21 16:44:42 INFO SparkContext: Invoking stop() from shutdown hook
19/01/21 16:44:42 INFO SparkUI: Stopped Spark web UI at http://192.168.1.9:4040
19/01/21 16:44:42 INFO YarnClientSchedulerBackend: Interrupting monitor thread
19/01/21 16:44:42 INFO YarnClientSchedulerBackend: Shutting down all executors
19/01/21 16:44:42 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
19/01/21 16:44:42 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
19/01/21 16:44:42 INFO YarnClientSchedulerBackend: Stopped
19/01/21 16:44:42 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/01/21 16:44:42 INFO MemoryStore: MemoryStore cleared
19/01/21 16:44:42 INFO BlockManager: BlockManager stopped
19/01/21 16:44:42 INFO BlockManagerMaster: BlockManagerMaster stopped
19/01/21 16:44:42 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/01/21 16:44:42 INFO SparkContext: Successfully stopped SparkContext
19/01/21 16:44:42 INFO ShutdownHookManager: Shutdown hook called
19/01/21 16:44:42 INFO ShutdownHookManager: Deleting directory /tmp/spark-88c6c289-4d49-4035-96d7-19ba6410ef8a