本文介紹如何基于Spark和Java来实现一个单词计数(Word Count)的程序。
创建工程
创建一个Maven工程,pom.xml文件如下:
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.github.ralgond</groupId>
<artifactId>spark-java-api</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>3.1.1</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
编写java类WordCount
创建一个包com.github.ralgond.sparkjavaapi,在该包下创建一个名为WordCount的类,该类内容如下:
package com.github.ralgond.sparkjavaapi;
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
public class WordCount {
public static void main(String args[]) {
String fileName = args[0];
SparkConf conf = new SparkConf().setAppName("WordCount Application");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> data1 = sc.textFile(fileName);
JavaRDD<String> data2 = data1.flatMap(line -> Arrays.asList(line.split("\\s+")).iterator());
JavaPairRDD<String, Integer> data3 = data2.mapToPair(e -> new Tuple2<String,Integer>(e, 1));
JavaPairRDD<String, Integer> data4 = data3.reduceByKey((a,b)-> a + b);
System.out.println(data4.collect());
sc.close();
}
}
编译并运行
通过mvn clean package编译出jar包spark-java-api-0.0.1-SNAPSHOT.jar。
到spark安装目录里,执行如下命令:
bin\spark-submit --class com.github.ralgond.sparkjavaapi.WordCount {..}\spark-java-api-0.0.1-SNAPSHOT.jar README.md
便可以看到结果: