- 版本及配置说明
- shell模式下wordcount示例
- 第一个spark实验scala
- 31 示例1WordCount结果打印在运行界面
- 32 示例2WordCount结果保存到文件
1. 版本及配置说明
spark+hadoop环境自行安装,可参考本实验坏境。spark系列从这里开始吧!
1 注意spark
和scala
的版本匹配。
2 本实验环境:
spark version 2.1.2-SNAPSHOT
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
3 spark配置文件:
$SPARK_HOME/conf/spark-env.sh:
53 export SPARK_HOME=/usr/local/spark-branch-2.1
54 export HADOOP_HOME=/usr/hadoop-env/hadoop
55 export JAVA_HOME=/usr/hadoop-env/jdk8
56 export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
57 export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
58 export SPARK_LIBARY_PATH=.:$JAVA_HOME/lib:$JAVA_HOME/jre/lib:$HADOOP_HOME/lib/native
59 export SPARK_MASTER_IP=master-1a
60 export SPARK_LOCAL_DIRS=/opt/spark
61 export SPARK_MASTER_PORT=7077
62 export SPARK_WORKER_CORES=2
63 #export SPARK_WORKER_INSTANCES=1
64 export SPARK_WORKER_MEMORY=4g
65 export SPARK_LOG_DIR=/var/lib/spark/logs
2. shell模式下wordcount示例
1 环境: hadoop+spark环境
2 源文件: HelloSpark.txt(位置/home/whbing/HelloSpark.txt本文件需上传到hdfs上)
hello spark
hello world
hello whut
3 上传文件至hdfs
hadoop fs -ls /(查看hdfs目录下文件)
hadoop fs -mkdir /Hadoop/Input(在HDFS创建目录)
hadoop fs -put /home/whbing/HelloSpark.txt /whbing
(将HelloSpark.txt文件上传到HDFS目录下的/whbing目录下)
hadoop fs -text /whbing/HelloSpark.txt (查看文件内容)
4 用spark-shell先测试一下wordcount任务。
cd $SPARK_HOME
./bin/spark-shell进入scala CLI
5 输入scala语句
scala> val file=sc.textFile("hdfs:///whbing/HelloSpark.txt")
file: org.apache.spark.rdd.RDD[String] = hdfs:///whbing/HelloSpark.txt MapPartitionsRDD[1] at textFile at :24
scala> val rdd = file.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at :26
scala> rdd.collect()
res0: Array[(String, Int)] = Array((whut,1), (hello,3), (world,1), (spark,1))
scala> rdd.foreach(println)
(spark,1)
(whut,1)
(hello,3)
(world,1)
shell模式测试通过。
6 启动集群
说明:本集群已自启动。
启动master:./sbin/start-master.sh
启动worker:(有参数,见下)./bin/spark-class
提交作业:(有参数,见下)./bin/spark-submit
启动worker后边要接两个参数,包名及master ip。
包名为org.apache.spark.deploy.worker.Worker
ip需要从master UI界面查看:
启动worker: ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://master-1a:7077
启动master和worker后,jps
可以看到master和worker的进程。
3. 第一个spark实验(scala)
1 (idea编辑器)NEW-scale-sbt
2 build.sbt:
name := "spark-scale-sbt-test1"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies ++=Seq(
"org.apache.spark" %% "spark-core" % "2.1.2"
)
3 src-main-scala目录下新建Scala Class
Name:WordCount
Kind:Object
3.1 示例1:WordCount结果打印在运行界面
WordCount2.scala:
/*
WordCount2.scala:
*/
import org.apache.spark.{SparkConf, SparkContext}
/**
* 使用scala开发集群运行模式,打印的方式
*/
object WordCount2 {
def main(args: Array[String]){
val conf=new SparkConf().setAppName("first scala spark wordcount!")
//setMaster,local模式下可以写,集群不用写,因为提交的时候会写
val sc=new SparkContext(conf)
/**
* 读取的文件是 hdfs中的文件,通过master的50070查看
* 有3中写法:1:sc.textFile("/whbing/HelloSpark.txt")
* 2:接hdfs:sc.textFile("hdfs:///whbing/HelloSpark.txt")
* 3: 写具体:sc.textFile("hdfs://http://master-1a:9000/whbing/HelloSpark.txt")
*/
val input=sc.textFile("/whbing/HelloSpark.txt")
//读取文件并切分成partions
val lines=input.flatMap(line=>line.split(" "))
val count=lines.map(word=>(word,1)).reduceByKey(_+_)
count.collect().foreach(wordNumberPair => println(wordNumberPair._1 + ":" + wordNumberPair._2))
sc.stop()
}
}
4 打包
File
–Project Structure
—-Atifacts
–add JAR
–from modules with dependencies
选择对应的module和MainClass
.MF选…src/main/java
5 build
Build
–Build Atifacts
–Build
成功之后会看到out文件夹下有jar包生成。
6 上传jar到集群目录(说明:jar文件可以存在任意目录,代码中写的读取文件要存在hdfs中)
7 提交作业
./bin/spark-submit --class WordCount2 --master spark://master-1a:7077 /home/whbing/spark-scale-sbt-test1.jar
8 结果显示:
在众多的INFO中找到结果:
wordCount实验至此大功告成。
说明:将命令写成脚本(注意给执行权限),方便运行:
submit.sh
cd $SPARK_HOME
./bin/spark-submit --class WordCount2 --master spark://master-1a:7077 /home/whbing/spark-scale-sbt-test1.jar
提交命令直接执行./submit.sh
即可
3.2 示例2:WordCount结果保存到文件
WordCount.scala
//WordCount.scala
import org.apache.spark.{SparkConf, SparkContext}
/**
* 使用scala开发集群运行模式,保存结果
*/
object WordCount {
def main(args: Array[String]){
val conf=new SparkConf().setAppName("first scala spark wordcount!")
//setMasterlocal模式下可以写,集群不用写,因为提交的时候会写
val sc=new SparkContext(conf)
/**
* 读取的文件是 hdfs中的文件,通过master的50070查看
* 有3中写法:1:sc.textFile("/whbing/HelloSpark.txt")
* 2:接hdfs:sc.textFile("hdfs:///whbing/HelloSpark.txt")
* 3: 写具体:sc.textFile("hdfs://http://master-1a:9000/whbing/HelloSpark.txt")
*/
val input=sc.textFile("/whbing/HelloSpark.txt")
//读取文件并切分成partions
val lines=input.flatMap(line=>line.split(" "))
val count=lines.map(word=>(word,1)).reduceByKey{case (x,y)=>x+y}
//保存的位置也是hdfs
val output=count.saveAsTextFile("/home/whbing/HelloSparkResult")
}
}
脚本:
cd $SPARK_HOME
#1.运行wordcount scala打印结果程序
#./bin/spark-submit --class WordCount2 --master spark://master-1a:7077 /home/whbing/spark-scale-sbt-test1.jar
#2.运行wordcount scala保存结果程序
./bin/spark-submit --class WordCount --master spark://master-1a:7077 /home/whbing/spark-scale-sbt-test1.jar
查看hdfs文件 http://master-1a:50070/
大功告成。