spark的submit

原创

mtj66 2022-01-04 17:52:48 ©著作权

©著作权归作者所有：来自51CTO博客作者mtj66的原创作品，请联系作者获取转载授权，否则将追究法律责任

# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

Some of the commonly used options are:

--class
: 应用的入口 (e.g. org.apache.spark.examples.SparkPi
)
--master
: 集群的master (e.g. spark://23.195.26.187:7077
)
--deploy-mode
: 在node节点启动 cluster模式(cluster
) ，外部模式：client (client
) (默认的)
--conf
: keyvalue对 “key=value”
application-jar
: jar包包含依赖的jar包必须可见（对集群来说必须可见），比如hdfs路径下：hdfs://
pathv或者是所有节点都存在的文件file://
path.
application-arguments
:传递给main函数的参数

`yarn`	Connect to a YARN cluster in `client` or `cluster` mode depending on the value of `--deploy-mode`. 将会从`HADOOP_CONF_DIR` or `YARN_CONF_DIR` 获取配置文件.
`yarn-client`	Equivalent to `yarn` with `--deploy-mode client`, which is preferred to `yarn-client`
`yarn-cluster`	Equivalent to `yarn` with `--deploy-mode cluster`, which is preferred to `yarn-cluster`

In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.

通常，显示通过SparkConf设置的参数的优先级优先级最高，其次是 spark-submit，最后才是默认的配置文件。

所以即使你的在spark-submit 中 --master 省略也是可以甚至 --deploy-mode 都是有默认值的，默认读取conf/spark-defaults.conf 。

更高级的设置：

当spark-submit时候, application jar 以及任何依赖的通过 --jars option 指定的jars，将会自动的转移到集群中 . spark采用以下几种方式允许采用不同的方法来传递jars:

file: - Absolute paths and file:/
URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.每个executor从driver的httpserver来拉取文件
hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup is handled automatically, and with Spark standalone, automatic cleanup can be configured with the spark.worker.cleanup.appDataTtl property. Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with --packages. All transitive dependencies will be handled when using this command. Additional repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag --repositories. These commands can be used with pyspark, spark-shell, and spark-submit to include Spark Packages. For Python, the equivalent --py-files option can be used to distribute .egg, .zip and .py libraries to executors.

More Information

Once you have deployed your application, the cluster mode overview describes the components involved in distributed execution, and how to monitor and debug applications.