Data Sinks
数据接收器使用DataStreams并将其转发到文件,套接字,外部系统或打印它们。Flink带有多种内置输出格式,这些格式封装在DataStreams的操作后面:
- writeAsText()/ TextOutputFormat-将元素按行写为字符串。通过调用每个元素的toString()方法获得字符串。
- writeAsCsv(…)/ CsvOutputFormat-将元组写为逗号分隔的值文件。行和字段定界符是可配置的。每个字段的值来自对象的toString()方法。
- print()/ printToErr() - 在标准输出/标准错误流上打印每个元素的toString()值。可选地,可以提供前缀(msg),该前缀在输出之前。这可以帮助区分不同的打印调用。如果并行度大于1,则输出之前还将带有产生输出的任务的标识符。
- writeUsingOutputFormat()/ FileOutputFormat-的方法和自定义文件输出基类。支持自定义对象到字节的转换。
- writeToSocket -根据一个元素将元素写入套接字 SerializationSchema
- addSink-调用自定义接收器功能。Flink捆绑有与其他系统(例如Apache Kafka)的连接器,这些连接器已实现为接收器功能。
请注意,上的write*()方法DataStream主要用于调试目的。它们不参与Flink的检查点,这意味着这些功能通常具有至少一次语义。刷新到目标系统的数据取决于OutputFormat的实现。这意味着并非所有发送到OutputFormat的元素都立即显示在目标系统中。同样,在失败的情况下,这些记录可能会丢失。
为了将流可靠,准确地一次传输到文件系统中,请使用flink-connector-filesystem。同样,通过该.addSink(…)方法的自定义实现可以参与Flink一次精确语义的检查点。
请注意DataStream上的write*()方法主要用于调试目的。
package com.baizhi.jsy.sink
import org.apache.flink.api.java.io.TextOutputFormat
import org.apache.flink.core.fs.Path
import org.apache.flink.streaming.api.scala._
object FlinkWordCountFileSink {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text = env.socketTextStream("Centos",9999)
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.writeUsingOutputFormat(new TextOutputFormat[(String, Int)](new Path("file:///D:/桌面文件/flink-result")))
//5.执⾏行行流计算任务
env.execute("Window Stream WordCount")
}
}
在桌面自动创建的文件
注意事项 :如果改成HDFS,需要⽤用户自己产生大量数据,才能看到测试效果,原因是因为HDFS文件系统写入时的缓冲区比较大。以上写入文件系统的Sink不能够参与系统检查点,如果在生产环
境下通常使用flink-connector-filesystem写入到外围系统。
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-filesystem_2.11</artifactId>
<version>1.10.0</version>
</dependency>
**自动创建文件夹 存储文件 **
package com.baizhi.jsy.sink
import org.apache.flink.api.common.serialization.SimpleStringEncoder
import org.apache.flink.core.fs.Path
import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink
import org.apache.flink.streaming.api.scala._
object FlinkWordCountBucketingSink {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)//分桶
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://Centos:9000/demo/words")
val bucketingSink = StreamingFileSink.forRowFormat(new Path("hdfs://Centos:9000/bucketing-result"),
new SimpleStringEncoder[(String,Int)]("UTF-8")).build()
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果导入目的地
counts.addSink(bucketingSink)
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
}
}
按日期格式 新版本
package com.baizhi.jsy.sink
import org.apache.flink.api.common.serialization.SimpleStringEncoder
import org.apache.flink.core.fs.Path
import org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink
import org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.DateTimeBucketAssigner
import org.apache.flink.streaming.api.scala._
object FlinkWordCountBucketingSink {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)//分桶
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://Centos:9000/demo/words")
val bucketingSink = StreamingFileSink.forRowFormat(new Path("hdfs://Centos:9000/bucketing-result"),
new SimpleStringEncoder[(String,Int)]("UTF-8"))
.withBucketAssigner(new DateTimeBucketAssigner[(String, Int)]("yyyy-MM-dd"))//动态产生写入路径
.build()
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果导入目的地
counts.addSink(bucketingSink)
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
}
}
老版本写法:
package com.baizhi.jsy.sink
import java.time.DateTimeException
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.fs.bucketing.{BucketingSink, DateTimeBucketer}
object FlinkWordCountBucketingSink1 {
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)//分桶
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://Centos:9000/demo/words")
val bucketingSink = new BucketingSink[(String,Int)]("hdfs://Centos:9000/bucket-result")
bucketingSink.setBucketer(new DateTimeBucketer[(String, Int)]("yyyy-MM-dd"))
bucketingSink.setBatchSize(1024)
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果导入目的地
counts.addSink(bucketingSink)
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
}
}
print() / / printToErr()
Prints the toString() value of each element on the standard out / standard error stream. Optionally, a prefix (msg) can be provided which is prepended to the output. This can help to distinguish between di!erent calls to print . If the parallelism is greater than 1, the output will also be prepended with the identifier of the task which produced the output.
package com.baizhi.jsy.sink
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.fs.bucketing.{BucketingSink, DateTimeBucketer}
object FlinkWordCountPrint{
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)//分桶
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://Centos:9000/demo/words")
val bucketingSink = new BucketingSink[(String,Int)]("hdfs://Centos:9000/bucket-result")
bucketingSink.setBucketer(new DateTimeBucketer[(String, Int)]("yyyy-MM-dd"))
bucketingSink.setBatchSize(1024)
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果导入目的地
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
}
}
设置并行度
package com.baizhi.jsy.sink
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.fs.bucketing.{BucketingSink, DateTimeBucketer}
object FlinkWordCountPrint{
def main(args: Array[String]): Unit = {
//1.创建流计算执⾏行行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)//分桶
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://Centos:9000/demo/words")
val bucketingSink = new BucketingSink[(String,Int)]("hdfs://Centos:9000/bucket-result")
bucketingSink.setBucketer(new DateTimeBucketer[(String, Int)]("yyyy-MM-dd"))
bucketingSink.setBatchSize(1024)
//3.执⾏行行DataStream的转换算⼦子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果导入目的地
counts.print("测试").setParallelism(4)
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
}
}
RedisSink
参考:https://bahir.apache.org/docs/flink/current/flink-streaming-redis/
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
启动redis
[root@localhost redis-4.0.10]# ./src/redis-server
[root@localhost ~]# ps -aux|grep redis
[root@localhost redis-4.0.10]# ./src/redis-cli -h 192.168.17.19 -p 6379
package com.baizhi.jsy.sink
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig
object FlinkWordCountRedisRink{
def main(args: Array[String]): Unit = {
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)//分桶
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://Centos:9000/demo/words")
var flinkJedisConf=new
FlinkJedisPoolConfig.Builder().setHost("192.168.17.19").setPort(6379).build();
//3.执行DataStream的转换算子
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果导入目的地
counts.addSink(new RedisSink(flinkJedisConf,new UserDefinedRedisMapper()))
//5.执流计算任务
env.execute("Window Stream WordCount")
}
}
package com.baizhi.jsy.sink
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}
class UserDefinedRedisMapper extends RedisMapper[(String,Int)]{
override def getCommandDescription: RedisCommandDescription = {
new RedisCommandDescription(RedisCommand.HSET,"WordCount")
}
override def getKeyFromData(t: (String, Int)): String = {
t._1
}
override def getValueFromData(t: (String, Int)): String = {
t._2+""
}
}
将数据写入kafka中
启动kafka
[root@Centos kafka_2.11-2.2.0]# ./bin/kafka-console-consumer.sh --bootstrap-server Centos:9092 --topic topic01 --group g1 --property print.key=true --property print.value=true --property key.separator=,
新方案 一
package com.baizhi.jsy.sink
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.Semantic
import org.apache.kafka.clients.producer.ProducerConfig
object FlinkWordCountKafkaSink {
def main(args: Array[String]): Unit = {
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://Centos:9000/demo/words")
val props = new Properties()
props.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "Centos:9092")
props.setProperty(ProducerConfig.BATCH_SIZE_CONFIG,"100")
props.setProperty(ProducerConfig.LINGER_MS_CONFIG,"500")
//Semantic.EXACTLY_ONCE:开启kafka幂等写特性
//Semantic.AT_LEAST_ONCE:开启Kafka Retries机制
val kafkaSink = new FlinkKafkaProducer[(String,Int)]("defult_topic",new UserDefinedKafkaSerializationSchema,props,Semantic.AT_LEAST_ONCE)
//3.执行DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果保存到本地 桌面文件
counts.addSink(kafkaSink)
//5.执行流计算任务
env.execute("Window Stream WordCount")
}
}
package com.baizhi.jsy.sink
import java.lang
import org.apache.flink.streaming.connectors.kafka.KafkaSerializationSchema
import org.apache.kafka.clients.producer.ProducerRecord
class UserDefinedKafkaSerializationSchema extends KafkaSerializationSchema[(String,Int)]{
override def serialize(t: (String, Int), aLong: lang.Long): ProducerRecord[Array[Byte], Array[Byte]] = {
return new ProducerRecord("topic01",t._1.getBytes(),t._2.toString.getBytes())
}
}
旧方案二:
产生defult_topic topic
package com.baizhi.jsy.sink
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.Semantic
import org.apache.kafka.clients.producer.ProducerConfig
object FlinkWordCountKafkaSink {
def main(args: Array[String]): Unit = {
//1.创建流计算执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://Centos:9000/demo/words")
val props = new Properties()
props.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "Centos:9092")
props.setProperty(ProducerConfig.BATCH_SIZE_CONFIG,"100")
props.setProperty(ProducerConfig.LINGER_MS_CONFIG,"500")
//Semantic.EXACTLY_ONCE:开启kafka幂等写特性
//Semantic.AT_LEAST_ONCE:开启Kafka Retries机制
val kafkaSink = new FlinkKafkaProducer[(String,Int)]("defult_topic",new UserDefinedKeyedSerializationSchema,props,Semantic.AT_LEAST_ONCE)
//3.执行DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果保存到本地 桌面文件
counts.addSink(kafkaSink)
//5.执行流计算任务
env.execute("Window Stream WordCount")
}
}
package com.baizhi.jsy.sink
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchema
class UserDefinedKeyedSerializationSchema extends KeyedSerializationSchema[(String,Int)]{
Int
override def serializeKey(element: (String, Int)): Array[Byte] = {
element._1.getBytes()
}
override def serializeValue(element: (String, Int)): Array[Byte] = {
element._2.toString.getBytes()
}
//可以覆盖 默认是topic,如果返回null,则将数据写⼊到默认的topic中
override def getTargetTopic(element: (String, Int)): String = {
null
}
}