学习致谢:

​https://www.bilibili.com/video/BV1Xz4y1m7cv?p=41​

需求:

从TCP Socket数据源实时消费数据,对每批次Batch数据进行词频统计WordCount,流程图如下:

Spark综合学习笔记(七)SparkStreaming案例1 WordCount_spark

准备工作
1.在node01上安装nc命令
nc是netcat的简称,原本是用来设置路由器,我们可以利用它向某个端口发送数据

yum install -y nc

代码的实现:

package streaming

import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
* Author itcast
* DESC 使用sparkStreaming接受node1:9999的数据并做WordCount
*/
object WordCount {
def main(args: Array[String]): Unit = {
//TODO 0.准备环境
val conf:SparkConf=new SparkConf().setMaster("spark").setMaster("local[*]")
val sc: SparkContext=new SparkContext(conf)
sc.setLogLevel("WARN")
//the time interval at which streaming data will be dicided into batches
val ssc:StreamingContext= new StreamingContext(sc,Seconds(5))
//TODO 1.加载数据
val lines:ReceiverInputDStream[String]=ssc.socketTextStream("node1",9999)
//TODO 2.处理数据
val resuleDS:DStream[(String,Int)]=lines.flatMap(_.split(" "))
.map((_,1))
.reduceByKey(_+_)

//TODO 3.输出结果
resuleDS.print()
//TODO 4.启动并等待结束
ssc.start()
ssc.awaitTermination()//注意:流式应用程序启动之后需要一直运行等待停止、等待到来
//TODO 5.关闭资源
ssc.stop(stopSparkContext = true,stopGracefully = true)//优雅关闭
}
}

虚拟机端输入

nc -lk 9999

然后输入每个流的数据信息,依据回车判定一批的数据

每隔5秒算一批,如果没有发,则为空

Spark综合学习笔记(七)SparkStreaming案例1 WordCount_big data_02