# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \ # can be client for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
1000


Some of the commonly used options are:


  • --class
    : 应用的入口 (e.g. org.apache.spark.examples.SparkPi
    )
  • --master
    : 集群的​​​master​​​ (e.g. spark://23.195.26.187:7077
    )
  • --deploy-mode
    : 在node节点启动 cluster模式(cluster
    ) , 外部模式:client (client
    ) (默认的)
  • --conf
    : keyvalue对 “key=value”
  • application-jar
    : jar包包含依赖的jar包必须可见(对集群来说必须可见),比如hdfs路径下:hdfs://
     pathv或者是所有节点都存在的文件file://
     path.
  • application-arguments
    :传递给main函数的参数



​yarn​

Connect to a ​​YARN ​​​cluster in ​​client​​ or ​​cluster​​ mode depending on the value of  ​​--deploy-mode​​. 将会从​​HADOOP_CONF_DIR​​ or ​​YARN_CONF_DIR​​ 获取配置文件.

​yarn-client​

Equivalent to ​​yarn​​ with ​​--deploy-mode client​​, which is preferred to `yarn-client`

​yarn-cluster​

Equivalent to ​​yarn​​ with ​​--deploy-mode cluster​​, which is preferred to `yarn-cluster`


 In general, configuration values explicitly set on a ​SparkConf​ take the highest precedence, then flags passed to ​spark-submit​, then values in the defaults file.

通常,显示通过SparkConf设置的参数的优先级优先级最高,其次是 spark-submit,最后才是默认的配置文件。

所以即使你的spark-submit 中  ​--master 省略也是可以 甚至 --deploy-mode 都是有默认值的,默认读取conf/spark-defaults.conf 。



更高级的设置:

当​​spark-submit时候​​, application jar 以及任何依赖的通过 ​--jars​ option 指定的jars,将会自动的转移到集群中 . spark采用以下几种方式允许采用不同的方法来传递jars:


  • file: - Absolute paths and file:/
     URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.每个executor从driver的httpserver来拉取文件
  • hdfs:http:https:ftp: - these pull down files and JARs from the URI as expected
  • local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup is handled automatically, and with Spark standalone, automatic cleanup can be configured with the ​​spark.worker.cleanup.appDataTtl​​ property. Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with ​​--packages​​. All transitive dependencies will be handled when using this command. Additional repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag ​​--repositories​​. These commands can be used with ​​pyspark​​, ​​spark-shell​​, and ​​spark-submit​​ to include Spark Packages. For Python, the equivalent ​​--py-files​​ option can be used to distribute ​​.egg​​, ​​.zip​​ and ​​.py​​ libraries to executors.

More Information

Once you have deployed your application, the ​​cluster mode overview​​ describes the components involved in distributed execution, and how to monitor and debug applications.