文章目录

  • ​​WordCount​​
  • ​​通过自定义数据源创建DStream​​
  • ​​通过Kafka数据源创建DStream​​
  • ​​版本选型​​
  • ​​Kafka 0-10 Direct模式​​
  • ​​Kafka 0-8 Receive模式​​
  • ​​Kafka 0-8 Direct模式​​

项目依赖

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.1.1</version>
</dependency>

WordCount

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}


object demo {
def main(args: Array[String]): Unit = {

// 创建配置文件对象 注意:Streaming程序至少不能设置为local,至少需要2个线程
val conf: SparkConf = new SparkConf().setAppName("").setMaster("local[*]")
// 创建Spark Streaming上下文环境对象。同时设置采集周期为3s
val ssc = new StreamingContext(conf, Seconds(3))
// 从端口中获取一行数据
val socketDS: ReceiverInputDStream[String] = ssc.socketTextStream(hostname = "hadoop102", port = 9999)
// 处理数据
val resDS: DStream[(String, Int)] = socketDS.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
// 打印数据
resDS.print()
// 启动采集器
ssc.start()
// 等待采集结束,终止上下文环境对象
ssc.awaitTermination()

}
}

通过NetCat发送数据:

nc -lk 9999

通过自定义数据源创建DStream

需要继承类Receiver,并实现onStart、onStop方法来自定义数据源采集

需求:自定义数据源,实现监控某个端口号,获取该端口号内容

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.receiver.Receiver
import org.apache.spark.streaming.{Seconds, StreamingContext}

import java.io.{BufferedReader, InputStreamReader}
import java.net.{ConnectException, Socket}
import java.nio.charset.StandardCharsets


object demo {
def main(args: Array[String]): Unit = {

// 创建配置文件对象 注意:Streaming程序至少不能设置为local,至少需要2个线程
val conf: SparkConf = new SparkConf().setAppName("").setMaster("local[*]")
// 创建Spark Streaming上下文环境对象
val ssc = new StreamingContext(conf, Seconds(3))

// 从端口中获取数据
val myDS: ReceiverInputDStream[String] = ssc.receiverStream(new MyReceiver("hadoop102", 9999))

// 数据处理
val resDS: DStream[(String, Int)] = myDS.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
// 打印结果
resDS.print()
// 启动采集器
ssc.start()
// 等待采集结束,终止上下文环境对象
ssc.awaitTermination()

}
}

// Receiver[T] T泛型即读取数据类型
class MyReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_ONLY) {
private var socket: Socket = _

// 真正处理数据的逻辑
def receive() {
try {
// 创建连接
socket = new Socket(host, port)
// 根据连接对象获取输入流
// socket.getInputStream字节流-->InputStreamReader转换流-->BufferedReader缓存字符流
val reader: BufferedReader = new BufferedReader(new InputStreamReader(socket.getInputStream, StandardCharsets.UTF_8))
//定义一个变量,用于接收读取到的一行数据
var line: String = null
while ((line = reader.readLine()) != null) {
store(line)
}
} catch {
case e: ConnectException =>
restart(s"Error connecting to $host:$port", e)
return
} finally {
onStop()
}
}

override def onStart(): Unit = {
new Thread("Socket Receiver") {
setDaemon(true) // 设置后台守护线程

override def run() {
receive()
}
}.start()
}

override def onStop(): Unit = {
synchronized {
if (socket != null) {
socket.close()
socket = null
}
}
}
}

通过Kafka数据源创建DStream

版本选型

​ReceiverAPI:​​​需要一个专门的Executor去接收数据,然后发送给其他的Executor做计算。存在的问题,接收数据的Executor和计算的Executor速度会有所不同,特别在接收数据的Executor速度大于计算的Executor速度,会导致计算数据的节点内存溢出。早期版本中提供此方式,​​现在基本不用了​

​DirectAPI:​​是由计算的Executor来主动消费Kafka的数据,速度由自身控制

Kafka 0-10 Direct模式

此处使用的​​spark版本为2.1.1​

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.1.1</version>
</dependency>
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object demo {
def main(args: Array[String]): Unit = {
// 1.创建SparkConf
val conf: SparkConf = new SparkConf().setAppName("x").setMaster("local[*]")
// 2.创建StreamingContext
val ssc = new StreamingContext(conf, Seconds(3))
// 3. 封装kafka配置参数
val kafkaParmas: Map[String, Object] = Map[String, Object](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "hadoop102:9092,hadoop103:9092,hadoop103:9092",
ConsumerConfig.GROUP_ID_CONFIG -> "groupid_s02",
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer"
)
// 4.消费Kafka数据创建流
val kafkaDStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](Set("topic_t02"), kafkaParmas))
// 5.
kafkaDStream.map(_.value()).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).print()
// 6. 启动任务
ssc.start()
// 7. 关闭任务
ssc.awaitTermination()
}
}

Kafka 0-8 Receive模式

此处使用的​​spark版本为2.1.1​

  1. 需要一个专门的Executor去接收数据,然后发送给其他的Executor做计算。存在的问题,接收数据的Executor和计算的Executor速度会有所不同,特别在接收数据的Executor速度大于计算的Executor速度,会导致计算数据的节点内存溢出
  2. offset维护在zookeeper中,程序停止后,继续生产数据,再次启动程序,仍然可以继续消费
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.1.1</version>
</dependency>
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
* @Desc : 通过ReceiverAPI连接kafka数据源,获取数据
* 需要导入依赖
*/
object SparkStreaming04_ReceiverAPI {
def main(args: Array[String]): Unit = {
//创建配置文件对象
//注意:SparkStreaming程序执行至少需要2个线程,所以不能设置为local
val conf: SparkConf = new SparkConf().setAppName("SparkStreaming04_ReceiverAPI").setMaster("local[*]")
//创建SparkStreaming上下文环境对象,并指定采集周期为3s
var ssc: StreamingContext = new StreamingContext(conf, Seconds(3))
//连接kafka,创建DStream
val kafkaDStream: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(
ssc,
"hadoop201:2181,hadoop202:2181,hadoop203:2181",
"test0105",
Map("bigdata-0105" -> 2)
)
//获取kafka中的消息,我们只需要v的部分
val lineDS: DStream[String] = kafkaDStream.map(_._2)
//扁平化
val flatMapDS: DStream[String] = lineDS.flatMap(_.split(" "))
//结构转换
val mapDS: DStream[(String, Int)] = flatMapDS.map((_, 1))
//聚合
val reduceDS: DStream[(String, Int)] = mapDS.reduceByKey(_ + _)
//打印输出
reduceDS.print
//开启任务
ssc.start()
ssc.awaitTermination()
}
}

Kafka 0-8 Direct模式

​这种API​​是由计算的Executor来主动消费Kafka的数据,速度由自身控制

导入依赖

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.1.1</version>
</dependency>

自动维护offset

1)设置offset维护在checkpoint中
2)获取StreamingContext的方式改变了
3)checkpoint目录中小文件过多,会导致效率非常低下

import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
* @Desc : 通过DirectAPI连接kafka数据源
* 缺点:
* 1 小文件过多
* 2 在checkpoint中,offset时间戳会记录到程序停止时,再次启动程序时,会从程序上次停止时到当前时间,把所有的周期都执行一次
*/

object SparkStreaming05_DirectAPI_Auto {
def main(args: Array[String]): Unit = {
//修改StreamingContext对象的获取方式,先从检查点cp中获取,如果检查点cp没有,通过函数创建。
val ssc: StreamingContext = StreamingContext.getActiveOrCreate("/cp", () => getStreamingContext)
//启动采集器
ssc.start()
//等到采集器结束,终止程序
ssc.awaitTermination()
}

def getStreamingContext(): StreamingContext = {
//创建配置文件对象
//注意:SparkStreaming程序执行至少需要2个线程,所以不能设置为local
val conf: SparkConf = new SparkConf().setAppName("x").setMaster("local[*]")
//创建SparkStreaming上下文环境对象,指定采集周期为3s
val ssc = new StreamingContext(conf, Seconds(3))
//设置检测点路径
ssc.checkpoint("/cp")
//准备kafka参数
val kafkaParams: Map[String, String] = Map[String, String](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "hadoop102:9092,hadoop103:9092,hadoop104:9092",
ConsumerConfig.GROUP_ID_CONFIG -> "groupid_s01"
)
val kafkaDstream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc,
kafkaParams,
Set("topic_t01")
)
//获取kafka中的消息,我们只需要v的部分
val lineDS: DStream[String] = kafkaDstream.map(_._2)
//扁平化
val flatMapRDD: DStream[String] = lineDS.flatMap(_.split(" "))
//结构转换
val mapDS: DStream[(String, Int)] = flatMapRDD.map((_, 1))
//聚合
val reduceDS: DStream[(String, Int)] = mapDS.reduceByKey(_ + _)
//打印输出
reduceDS.print
ssc
}
}

手动维护offset

import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.{SparkConf}
import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils, OffsetRange}
import org.apache.spark.streaming.{Seconds, StreamingContext}


object demo {
def main(args: Array[String]): Unit = {
// 创建配置文件对象
// 注意:SparkStreaming执行至少需要两个线程
val conf: SparkConf = new SparkConf().setAppName("x").setMaster("local[*]")
// 创建SparkStreaming上下文环境对象
// 执行周期为3s
val ssc = new StreamingContext(conf, Seconds(3))
// 准备kafka参数
val kafkaParams: Map[String, String] = Map[String, String](
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "hadoop102:9092,hadoop103:9092,hadoop104:9092",
ConsumerConfig.GROUP_ID_CONFIG -> "groupid_g01"
)
// 获取上一次消费的位置(偏移量)
// 实际项目中,为了保证数据精准的一致性,我们对数据进行消费处理之后,通常将偏移量保存至有事务的存储中,如MySQL
val fromOffsets: Map[TopicAndPartition, Long] = Map[TopicAndPartition, Long](
TopicAndPartition("topic_t01", 0) -> 1L,
TopicAndPartition("topic_t01", 1) -> 2L
)
// 从指定的offset读取数据进行消费
val kafkaDstream: InputDStream[String] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, String](
ssc,
kafkaParams,
fromOffsets,
(m: MessageAndMetadata[String, String]) => m.message())
// 定义空集合用于存放数据的offset
var offsetRanges = Array.empty[OffsetRange]

// 将当前消费到的offset进行保存
kafkaDstream.transform { rdd =>
offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
}.foreachRDD { rdd =>
for (o <- offsetRanges) {
println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
}
// 启动采集器
ssc.start()
// 等待采集器结束之后,结束程序
ssc.awaitTermination()
}
}