一 ,介绍 :
1 ,介绍 :
Spark Streaming 类似于 Apache Storm,用于流式数据的处理。根据其官方文档介绍,Spark Streaming 有高吞吐量和容错能力强等特点。Spark Streaming 支持的数据输入源很多,例如:Kafka、Flume、Twitter、ZeroMQ 和简单的 TCP 套接字等等。数据输入后可以用 Spark 的高度抽象原语如 :map、reduce、join、window 等进行运算。而结果也能保存在很多地方,如 HDFS,数据库等。另外 Spark Streaming 也能和 MLlib(机器学习)以及 Graphx 完美融合。
2 ,架构图 : 输入 - 转换 - 输出
3 ,离散化流 : 每个一段时间搜集数据,形成 rdd
和Spark基于RDD的概念很相似,Spark Streaming使用离散化流(discretized stream)作为抽象表示,叫作DStream。DStream 是随时间推移而收到的数据的序列。在内部,每个时间区间收到的数据都作为 RDD 存在,而 DStream 是由这些 RDD 所组成的序列(因此 得名“离散化”)。
4 ,输入 - 转换 - 输出 :
DStream 可以从各种输入源创建,比如 Flume、Kafka 或者 HDFS。创建出来的DStream 支持两种操作,一种是转化操作(transformation),会生成一个新的DStream,另一种是输出操作(output operation),可以把数据写入外部系统中。DStream 提供了许多与 RDD 所支持的操作相类似的操作支持,还增加了与时间相关的新操作,比如滑动窗口。
5 ,Spark 与 Storm 的对比
二 ,抽象 :
1 ,Dstream : 离散化流
- 跟 storm 不一样
- 不是连续不断的水流
- 是一滴一滴的水滴
2 ,怎么用 :
间隔时间 : 自己定
3 ,时间线 - 业务线 :
4 ,spark streaming 整体架构 :
三 ,入门案例 :
1 ,wc :思路
用 sparkStreaming 做 wc
2 ,安装 nc :
yum install -y nc.x86_64
3 ,计划 :
- 用 nc 发送数据到制定端口。
- 用 sparkStreaming 监控那个端口。
- 处理得到的数据。
4 ,日志控制 : log4j.properties
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Set everything to be logged to the console
log4j.rootCategory=error, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
5 ,代码 :
package day06_sparkStreaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object Demo01_wc {
def main(args: Array[String]): Unit = {
// sc
val ssWc: SparkConf = new SparkConf().setMaster("local[*]").setAppName("ssWc")
// ssc,5 秒
val ssc: StreamingContext = new StreamingContext(ssWc,Seconds(5))
// 监控端口 : 得到一行一行的数据 ( 跟 textFile 没有本质的区别 )
val lineDStream: ReceiverInputDStream[String] = ssc.socketTextStream("node01",9999)
// 切割
val wordsDStream: DStream[String] = lineDStream.flatMap(_.split(" "))
// 映射元组
val yuanDStream: DStream[(String, Int)] = wordsDStream.map((_,1))
// 计算
yuanDStream.reduceByKey(_+_).print()
ssc.start()
ssc.awaitTermination()
}
}
6 ,测试 :
- 在 node01 发数据 :
- 在 idea 启动程序,监控端口 :
- 在 node01 发数据 :
7 ,正确的结果 : 每五秒搜集一次数据
四 ,自定义接收器 :
1 ,接收到的数据类型 - String ,默认存储级别 - 内存 :
package day06_sparkStreaming
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver
class CustomerReceiver extends Receiver[String](StorageLevel.MEMORY_ONLY) {
override def onStart(): Unit = ???
override def onStop(): Unit = ???
}
2 ,全部代码 :
package day06_sparkStreaming
import java.io.{BufferedReader, InputStreamReader}
import java.net.Socket
import java.nio.charset.StandardCharsets
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver
class CustomerReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_ONLY) {
// 启动接收器,调用 ( 核心代码 )
override def onStart(): Unit = {
// 开启一个线程来接收数据
new Thread("receiver"){
override def run() : Unit = {
// 接收数据,交给框架
receive()
}
}.start()
}
// 关闭接收器,调用
override def onStop(): Unit = {}
// 接收数据
def receive():Unit = {
var socket: Socket = null
var userInput: String = null
try {
// 建立连接
socket = new Socket(host, port)
// 读入流
val reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8))
// 读一行
userInput = reader.readLine()
// 循环读
while(!isStopped && userInput != null) {
// 存储
store(userInput)
// 接着读
userInput = reader.readLine()
}
reader.close()
socket.close()
// 重连
restart("Trying to connect again")
} catch {
case e: java.net.ConnectException =>
// restart if could not connect to server
restart("Error connecting to " + host + ":" + port, e)
case t: Throwable =>
// restart if there is any other error
restart("Error receiving data", t)
}
}
}
3 ,使用自定义接收器 :
package day06_sparkStreaming
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object Demo01_wc {
def main(args: Array[String]): Unit = {
// sc
val ssWc: SparkConf = new SparkConf().setMaster("local[*]").setAppName("ssWc")
// ssc,5 秒
val ssc: StreamingContext = new StreamingContext(ssWc,Seconds(5))
// 监控端口 : 得到一行一行的数据 ( 跟 textFile 没有本质的区别 )
// val lineDStream: ReceiverInputDStream[String] = ssc.socketTextStream("node01",9999)
val lineDStream: ReceiverInputDStream[String] = ssc.receiverStream(new CustomerReceiver("node01",9999))
// 切割
val wordsDStream: DStream[String] = lineDStream.flatMap(_.split(" "))
// 映射元组
val yuanDStream: DStream[(String, Int)] = wordsDStream.map((_,1))
// 计算
yuanDStream.reduceByKey(_+_).print()
ssc.start()
ssc.awaitTermination()
}
}
4 ,测试 :
- 在 node01 启动 nc 。
- 在 idea 启动程序,监控端口 。
- 在 node01 发数据 。
- 结果 : 在 idea 看到统计结果 。