目录
- 前言
- 一、Linking Denpency
- 二、Common Writing
- a. 主类
- b. 辅类(KafkaProducer的包装类)
- 三、OOP 方式(扩展性增强)
- a.Trait
- b.继承的Class&Trait
- c. Excutor Class
- d.Test
前言
这里演示从kafka读取数据对数据变形后再写回Kafka的过程,分为一般写法和OOP写法。
一、Linking Denpency
- poml依赖
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<spark_version>2.1.0</spark_version>
<kafka_vesion>2.0.0</kafka_vesion>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark_version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark_version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark_version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>${kafka_vesion}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>${kafka_vesion}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${kafka_vesion}</version>
</dependency>
</dependencies>
二、Common Writing
- KafkaUtils.createDirectStream:定期地从kafka的topic下对应的partition中查询最新的偏移量,再根据偏移量范围在每个batch里面处理数据,Spark通过调用kafka简单的消费者Api(低级api)读取一定范围的数据。
a. 主类
- 通过InputDStream将数据从kafka拉取至本地,处理完之后通过广播到各个Excutor的KafkaProducer的包装类将数据再发送至Kafka;
package cn.kgc.mysparkstreaming
import cn.kgc.mysparkstreaming.services.commion.KafkaSink
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig}
import org.apache.kafka.common.serialization.{StringDeserializer, StringSerializer}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}
object ReadKafkaTopic {
def main(args: Array[String]): Unit = {
//创建SparkConf
val conf: SparkConf = new SparkConf()
.setAppName(this.getClass.getName)
.setMaster("local[4]")
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
.set("spark.streaming.kafka.consumer.poll.ms","10000")
//创建StreamingContext
val ssc: StreamingContext = new StreamingContext(conf, Seconds(1))
ssc.checkpoint("e:/ck")
//kafka consumer 参数
val kafkaParams = Map(
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.237.160:9092",
ConsumerConfig.GROUP_ID_CONFIG -> "attend",
ConsumerConfig.MAX_POLL_RECORDS_CONFIG->"500",
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest"
)
//构建DStream
val ku: InputDStream[ConsumerRecord[String,String]] =
KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String,String](Set("event_attendees_rows"), kafkaParams))
//kafka producer参数
val producerParams = Map(
ProducerConfig.BOOTSTRAP_SERVERS_CONFIG->"192.168.237.160:9092",
ProducerConfig.ACKS_CONFIG->"-1",
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG->classOf[StringSerializer],
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG->classOf[StringSerializer]
)
//用KafkaSink作为广播的对象
val ks = ssc.sparkContext.broadcast(KafkaSink[String,String](producerParams))
//数据处理并写入
ku.flatMap(line=>{
//如果最后n位都是切割符,split(" ")不会继续切分,split(" ", -1)会继续切分,切分得到的是""
val info = line.value().split(",",-1)
val yes = info(1).split(" ").map(x=>(info(0),x,"yes"))
val maybe = info(2).split(" ").map(x=>(info(0),x,"maybe"))
val invited = info(3).split(" ").map(x=>(info(0),x,"invited"))
val no = info(4).split(" ").map(x=>(info(0),x,"no"))
yes++maybe++invited++no
}).foreachRDD(rdd=>{
rdd.foreachPartition(iter=>{
iter.filter(_._2!="").map(msg=>{
//通过KafkaSink send方法发送至kafka,第一个参数为topic,第二个参数为value
ks.value.send("event_attendees_ss",msg.productIterator.mkString(","))
}).foreach(x=>x)
})
})
ssc.start()
ssc.awaitTermination()
}
}
b. 辅类(KafkaProducer的包装类)
- 每一个Excutor在向Kafka发送消息时,都需要new KafkaProducer,这里将KafkaProducer在Driver端创建好广播至其他Excutor;
- 因为KafkaProducer无法序列化传输,所以写一个类继承Serializable并封装它使其可以序列化传输
package cn.wsj.mysparkstreaming.services.common
import java.util.Properties
import scala.collection.JavaConversions._
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
//继承Serializable从而可序列化传输
class KafkaSink[K,V](config:Map[String,Object]) extends Serializable {
lazy val producer = new KafkaProducer[K,V](config)
sys.addShutdownHook(
//确保在执行程序JVM关闭时,Kafka生产者发送在关闭之前任何缓冲到Kafka的消息
producer.close()
)
//发送方法
def send(topic:String,key:K,value:V) ={
producer.send(new ProducerRecord[K,V](topic,key,value))
}
//发送方法
def send(topic:String,value:V) ={
producer.send(new ProducerRecord[K,V](topic,value))
}
}
object KafkaSink{
def apply[K,V](config: Map[String, Object]): KafkaSink[K,V] = new KafkaSink(config)
def apply[K,V](config:Properties):KafkaSink[K,V] = new KafkaSink(config.toMap)
}
三、OOP 方式(扩展性增强)
这样写主要考虑到将来读写的两端即是换成其他数据源也可以简单的实现读写流程
a.Trait
这里增加三个特质,分别对应读、写、变形(数据处理对应接口,这里以后可以根据需求写很多的变形特质之后动态混入Object中实现对应的变形方法);
- ReadTrait:参数包含Properties、TABLENAME(TOPICNAME),返回类型肯定是DSteam[T]
package cn.wsj.mysparkstreaming.services
import java.util.Properties
import org.apache.spark.streaming.dstream.DStream
trait ReadTrait[T] {
import scala.collection.JavaConversions._
def reader(prop:Map[String,Object],tableName:String):DStream[T]
def reader(prop:Properties,tableName:String):DStream[T]=reader(prop.toMap[String,Object],tableName)
}
- WriteTrait :参数包含Properties、TABLENAME(TOPICNAME)
package cn.wsj.mysparkstreaming.services
import java.util.Properties
import org.apache.spark.streaming.dstream.DStream
trait WriteTrait[T] {
import scala.collection.JavaConversions._
def write(prop:Map[String,Object],tableName:String,ds:DStream[T]):Unit
def write(prop:Properties,tableName:String,ds:DStream[T]):Unit=write(prop.toMap[String,Object],tableName,ds)
}
- TransformTrait:负责将读取的数据变形,参数为输入流,输出类型依然为DStream,这里具体的数据类型要根据读写的数据源决定,所以都采用泛型表示;
package cn.wsj.mysparkstreaming.services
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
trait TransformTrait[T,V] {
def transform(in:InputDStream[T]):DStream[V]
}
b.继承的Class&Trait
- KafkaReader:继承ReadTrait,负责创建InputDStream
package cn.wsj.mysparkstreaming.services.impl
import cn.wsj.mysparkstreaming.services.ReadTrait
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
class KafkaReader(ssc:StreamingContext) extends ReadTrait[ConsumerRecord[String,String]]{
override def reader(prop: Map[String, Object], tableName: String): InputDStream[ConsumerRecord[String,String]] = {
KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Set(tableName), prop))
}
}
object KafkaReader{
def apply(ssc: StreamingContext): KafkaReader = new KafkaReader(ssc)
}
- KafkaWriter:继承WriteTrait,通过KafkaProducer的包装类将数据写入Kafka,包装类和上面common writing写法一致;
package cn.wsj.mysparkstreaming.services.impl
import cn.wsj.mysparkstreaming.services.commion.KafkaSink
import cn.wsj.mysparkstreaming.services.WriteTrait
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.DStream
class KafkaWriter[T](ssc:StreamingContext) extends WriteTrait[String]{
override def write(prop: Map[String, Object], tableName: String, ds: DStream[String]): Unit = {
val bc = ssc.sparkContext.broadcast(KafkaSink[String,String](prop))
ds.foreachRDD(rdd=>rdd.foreachPartition(iter=>{
iter.map(recode=>{
bc.value.send(tableName,recode)
}).foreach(x=>x)
}))
}
}
object KafkaWriter{
def apply[T](ssc: StreamingContext): KafkaWriter[T] = new KafkaWriter(ssc)
}
- User_Friends_Trait:继承TransformTrait,这里根据需求对数据进行变形操作;
package cn.wsj.mysparkstreaming.services.userimpl
import cn.wsj.mysparkstreaming.services.TransformTrait
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
trait User_Friends_Trait extends TransformTrait[ConsumerRecord[String,String],String]{
override def transform(in: InputDStream[ConsumerRecord[String, String]]): DStream[String] = {
in.filter(ln => {
var reg = ",$".r
val iter = reg.findAllMatchIn(ln.value())
!iter.hasNext
}).flatMap(line => {
val uf = line.value().split(",")
uf(1).split(" ").map(x => (uf(0), x))
}).map(_.productIterator.mkString(","))
}
}
c. Excutor Class
- KTKExecutor:将上面的Trait和Class整合,测试时只需new KTKExecutor再调用方法即可完成读写操作;
package cn.wsj.mysparkstreaming.services
import cn.wsj.mysparkstreaming.services.impl.{KafkaReader, KafkaWriter}
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.InputDStream
class KTKExecutor(readConf:Map[String,Object],writeConf:Map[String,Object]) {
tran:TransformTrait[ConsumerRecord[String,String],String]=>
def worker(intopic:String,outtopic:String) ={
//创建一个StreamingContext
val conf: SparkConf = new SparkConf()
.setAppName(this.getClass.getName)
.setMaster("local[4]")
.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
.set("spark.streaming.kafka.consumer.poll.ms","10000")
val ssc: StreamingContext = new StreamingContext(conf, Seconds(1))
//设置检查点目录
ssc.checkpoint("e:/ck")
//调用kafka数据读取
val kr = new KafkaReader(ssc).reader(readConf,intopic)
//调用混入特质进行数据处理
var ds = tran.transform(kr)
//调用Kafka写入topic
KafkaWriter(ssc).write(writeConf,outtopic,ds)
ssc.start()
ssc.awaitTermination()
}
}
d.Test
- 这里我们简单的将参数设定好,实例化对象后调用work方法便完成了!
package cn.wsj.mysparkstreaming
import cn.kgc.mysparkstreaming.services.KTKExecutor
import cn.kgc.mysparkstreaming.services.userimpl.{Event_Attendees_Trait, User_Friends_Trait}
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.kafka.clients.producer.ProducerConfig
import org.apache.kafka.common.serialization.{StringDeserializer, StringSerializer}
object Test {
def main(args: Array[String]): Unit = {
val inParams=Map(
ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "192.168.237.160:9092",
ConsumerConfig.GROUP_ID_CONFIG -> "uf",
ConsumerConfig.MAX_POLL_RECORDS_CONFIG->"500",
ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> classOf[StringDeserializer],
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest"
)
val outParams=Map(
ProducerConfig.BOOTSTRAP_SERVERS_CONFIG->"192.168.237.160:9092",
ProducerConfig.ACKS_CONFIG->"-1",
ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG->classOf[StringSerializer],
ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG->classOf[StringSerializer]
)
(new KTKExecutor(inParams,outParams) with User_Friends_Trait)
.worker("user_friends","user_friends_row")
}
}
PS:如果有写错或者写的不好的地方,欢迎各位大佬在评论区留下宝贵的意见或者建议,敬上!如果这篇博客对您有帮助,希望您可以顺手帮我点个赞!不胜感谢!
原创作者:wsjslient