http://spark.apache.org/docs/latest/quick-start.html
Quick Start
- Security
- Interactive Analysis with the Spark Shell
- Basics
- More on Dataset Operations
- Caching
- Self-Contained Applications
- Where to Go from Here
This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’s interactive shell (in Python or Scala), then show how to write applications in Java, Scala, and Python.
该指南提供了快速使用Spark的入门介绍。首先,我们会通过Spark的Python或者Scala交互式shell来介绍相关的API,然后我们在指导大家如何使用Java、Scala和Python来编写应用程序。
To follow along with this guide, first, download a packaged release of Spark from the Spark website. Since we won’t be using HDFS, you can download a package for any version of Hadoop.
紧接着,从Spark website下载打包好的Spark发行版。由于一开始我们不会使用到HDFS,所以你可以下载集成任何一个Hadoop版本的包。
Note that, before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD). After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood. The RDD interface is still supported, and you can get a more detailed reference at the RDD programming guide. However, we highly recommend you to switch to use Dataset, which has better performance than RDD. See the SQL programming guide to get more information about Dataset.
需要注意的是,在Spark 2.0之前,Spark主要的编程接口是弹性分布式数据集,简称为RDD。但Spark 2.0之后,RDDs已经被Dataset取代。Dataset就像RDD一样是强类型的,但与RDD不同的是,Dataset底层做了更多的优化。但RDD接口依然支持,详细信息可以查看RDD programming guide。不过,我们还是强烈建议你使用Dataset,因为与RDD相比,它性能更高。想了解更多关于Dataset的信息可以查看SQL programming guide 。
Security
Security in Spark is OFF by default. This could mean you are vulnerable to attack by default. Please see Spark Security before running Spark.
Spark的安全模式默认是关闭的。这意味着你很容易会遭受攻击。在运行Spark之前请查看Spark Security 这个章节。
Interactive Analysis with the Spark Shell
Basics
Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries) or Python. Start it by running the following in the Spark directory:
Spark shell提供了一个简单的方式方便我们学习API,它也是一个帮助我们以交互式的方式来分析数据的强有力的工具。它可以以Scala的方式来运行。也就是说,它可以运行在Java虚拟机上,因此,它能够很便捷的使用已经存在的Java库。除此之外,它也可以以Python的方式来运行。下面可以在Spark目录下通过运行以下命令来启动spark shell
./bin/spark-shell
Spark’s primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Let’s make a new Dataset from the text of the README file in the Spark source directory:
Spark中的一等抽象是Dataset,一个分布式对象集合。Datasets可以通过Hadoop InputFormats 来创建,比如HDFS files,或者从另外的Datasets转变而来。让我们先使用Spark源代码目录中README文本文件来创建一个新的Dataset。
scala> val textFile = spark.read.textFile("README.md")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]
You can get values from Dataset directly, by calling some actions, or transform the Dataset to get a new one. For more details, please read the API doc.
你可以通过调用一些actions算子来直接从Dataset中获取值,或者直接转换Dataset来得到一个新的Dataset。详细信息,请查看API doc。
scala> textFile.count() // Number of items in this Dataset
res0: Long = 126 // May be different from yours as README.md will change over time, similar to other outputs
scala> textFile.first() // First item in this Dataset
res1: String = # Apache Spark
Now let’s transform this Dataset into a new one. We call filter to return a new Dataset with a subset of the items in the file.
现在让我们来对Dataset做一次转换生成一个新的Dataset。我们将会调用filter方法来返回一个Dataset,该Dataset包含了该文件所有对象的一个子集。
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]
We can chain together transformations and actions:
我们也可以把transformations and actions算子链接起来。
scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15
More on Dataset Operations
Dataset actions and transformations can be used for more complex computations. Let’s say we want to find the line with the most words
Dataset的actions和transformations能够被用来处理更复杂的计算。比如说我们想找出单词数最多的一行
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 15
This first maps a line to an integer value, creating a new Dataset. reduce is called on that Dataset to find the largest word count. The arguments to map and reduce are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. We’ll use Math.max() function to make this code easier to understand:
首先,第一个算子会把一行映射成一个整形,每个算子都会创建一个Dataset。然后reduce算子会被调用,此时又创建了一个新的Dataset来找到最大的单词数。其中map和reduce的参数是Scala函数的闭包,它能够使用任何一种语言的特性或者Scala/Java的库。举个例子,我们可以很方便的在任何地方调用已经申明的函数。这里,我们将会使用Math.max()函数来使得代码更加容易理解。
scala> import java.lang.Math
import java.lang.Math
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
res5: Int = 15
One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:
一个常见的数据流模型是Hadoop的MapReduce。Spark能够非常容易的实现MapReduce工作流。
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).groupByKey(identity).count()
wordCounts: org.apache.spark.sql.Dataset[(String, Long)] = [value: string, count(1): bigint]
Here, we call flatMap to transform a Dataset of lines to a Dataset of words, and then combine groupByKey and count to compute the per-word counts in the file as a Dataset of (String, Long) pairs. To collect the word counts in our shell, we can call collect:
这里,我们调用flatMap来把类型为行的Dataset转化为类型为单词的Dataset,然后结合groupByKey 和 count 计算文件中每个单词出现的次数,此时的Dataset类型为(String, Long)的键值对。为了在我们的shell中输出单词出现的次数,我们可以调用collect方法
scala> wordCounts.collect()
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)
Caching
Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank. As a simple example, let’s mark our linesWithSpark dataset to be cached:
Spark同样支持把数据集保存到集群范围内的内存缓存。当数据被频繁访问的时候,举个例子,当我们需要查询一个小的热点dataset或者运行一个像PageRank那样的迭代算法的时候,这将会非常有用。下面是个简单的小例子,我们把linesWithSpark这个dataset放到缓存。
scala> linesWithSpark.cache()
res7: linesWithSpark.type = [value: string]
scala> linesWithSpark.count()
res8: Long = 15
scala> linesWithSpark.count()
res9: Long = 15
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is that these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting bin/spark-shell to a cluster, as described in the RDD programming guide.
使用Spark把一个100行的文本文件做缓存这种做法看起来好像有点不太合适。但有趣的是,这些函数可以作用在非常大的数据集上,即使这些数据集被分布到成千上万的节点上。你依然可以通过连接 bin/spark-shell 到一个集群上,然后交互式的去调用这些函数,就像RDD programming guide 中所描述的那样。
Self-Contained Applications
Suppose we wish to write a self-contained application using the Spark API. We will walk through a simple application in Scala (with sbt), Java (with Maven), and Python (pip).
假设我们要使用Spark API来写一个单独的应用,让我们分别使用Scala (with sbt), Java (with Maven), and Python (pip)来走一个。
We’ll create a very simple Spark application in Scala–so simple, in fact, that it’s named SimpleApp.scala:
我们在一个名为SimpleApp.scala的文件中使用Scala语言创建一个非常简单的Spark应用程序。
/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}
Note that applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.
值得注意的是,每个application都要定义main函数而不是继承scala.App。在这里,scala.App的子类不能正常运行。
This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in the Spark README. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkSession, we initialize a SparkSession as part of the program.
这个程序实现的功能是统计Spark README这个文件中分别出现a和b的行数。这里,你需要把YOUR_SPARK_HOME替换成你本地Spark的安装位置。和之前使用Spark shell来运行例子不同的是,使用Spark shell的时候,它会自动初始化SparkSession。然而,这里我们是在程序当中去初始化SparkSession。
We call SparkSession.builder to construct a [[SparkSession]], then set the application name, and finally call getOrCreate to get the [[SparkSession]] instance.
我们通过调用SparkSession.builder来创建一个[[SparkSession]],然后设置应用的名字,最后通过调用getOrCreate方法来获取[[SparkSession]]实例。
Our application depends on the Spark API, so we’ll also include an sbt configuration file, build.sbt, which explains that Spark is a dependency. This file also adds a repository that Spark depends on:
我们的应用程序是依赖Spark API,因此,我们也需要添加一个sbt配置文件,build.sbt来解释Spark是一个依赖。并且这个配置文件会添加Spark所以来的repository。
name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.8"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.3"
For sbt to work correctly, we’ll need to layout SimpleApp.scala and build.sbt according to the typical directory structure. Once that is in place, we can create a JAR package containing the application’s code, then use the spark-submit script to run our program.
为了使得sbt能够正常运行,我们需要根据特定的目录结构来把SimpleApp.scala 和 build.sbt这两个文件放到相应的目录里。一旦完成了上面所说的,我们就能创建一个包含应用程序代码的一个JAR包,接着我们就能使用spark-submit脚本来运行我们的程序。
# Your directory layout should look like this
$ find .
.
./build.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.12/simple-project_2.12-1.0.jar
# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
--class "SimpleApp" \
--master local[4] \
target/scala-2.12/simple-project_2.12-1.0.jar
...
Lines with a: 46, Lines with b: 23
Where to Go from Here
Congratulations on running your first Spark application!
- For an in-depth overview of the API, start with the RDD programming guide and the SQL programming guide, or see “Programming Guides” menu for other components.
- For running applications on a cluster, head to the deployment overview.
- Finally, Spark includes several samples in the examples directory (Scala, Java, Python, R). You can run them as follows:
# For Scala and Java, use run-example:
./bin/run-example SparkPi
# For Python examples, use spark-submit directly:
./bin/spark-submit examples/src/main/python/pi.py
# For R examples, use spark-submit directly:
./bin/spark-submit examples/src/main/r/dataframe.R