Flink编程三大组件(二)——Sink_Flink

Data Sink 就是数据落地。如上图,Source就是数据的来源,中间的 Compute 其实就是 Flink干的事情,可以做一系列 的操作,操作完后就把计算后的数据结果 Sink 到某个地方。这个 sink 的意思也不一定非得说成要把数据存储到某个地方去。其实官网用的 Connector 来形容要去的地方更合适,这个 Connector 可以有 MySQL、ElasticSearch、Kafka、Cassandra、RabbitMQ 等。

Maven依赖

<properties>
	<flink.version>1.7.2</flink.version>
</properties>
<dependencies>
	<!-- flink核心API -->
	<dependency>
		<groupId>org.apache.flink</groupId>
		<artifactId>flink-java</artifactId>
		<version>${flink.version}</version>
	</dependency>
	<dependency>
		<groupId>org.apache.flink</groupId>
		<artifactId>flink-scala_2.11</artifactId>
		<version>${flink.version}</version>
	</dependency>
	<dependency>
		<groupId>org.apache.flink</groupId>
		<artifactId>flink-streaming-java_2.11</artifactId>
		<version>${flink.version}</version>
	</dependency>
	<dependency>
		<groupId>org.apache.flink</groupId>
		<artifactId>flink-streaming-scala_2.11</artifactId>
		<version>${flink.version}</version>
	</dependency>
</dependencies>

1.内置Sink

Flink 将转换计算后的数据发送的地点,你可能需要写入文件或者打印出来。

Flink 支持的内置 Sink 的种类:

  • 直接打印输出

    • print() / printToErr()

      打印标准输出/标准错误流上每个元素的 toString()值。

  • 基于文件系统

    • writeAsText() / TextOutputFormat

      将元素作为字符串逐行写入。通过调用每个元素的 toString()方法获得字符串

    • writeAsCsv(…) / CsvOutputFormat

      将元组写入逗号分隔值(csv)文件。行和字段分隔符是可配置的。每个字段的值来自 对象的 toString()方法

    • write() / FileOutputFormat

      自定义文件输出的方法和基类。支持自定义对象到字节的转换

    • output()/ OutputFormat

      大多数通用输出方法,用于非基于文件的 Data Sink(例如将结果存储在数据库中)

SinkStandardOutput.scala

package blog.sink

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._

/**
  * @Author Daniel
  * @Description Flink Sink——数据落地到标准输出
  *
  **/
object SinkStandardOutput {

  def main(args: Array[String]): Unit = {

    val env = ExecutionEnvironment.getExecutionEnvironment
    val stu: DataSet[(Int, String, Double)] = env.fromElements(
      (19, "Wilson", 178.8),
      (17, "Edith", 168.8),
      (18, "Joyce", 174.8),
      (18, "May", 195.8),
      (18, "Gloria", 182.7),
      (21, "Jessie", 184.8)
    )
    println("-------------sink到标准输出--------------------")
    stu.print()

    println("-------------sink到标准error输出--------------------")
    stu.printToErr()

    println("-------------sink到本地Collection--------------------")
    print(stu.collect())
  }
}

SinkFile.scala

package blog.sink

import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment, _}
import org.apache.flink.core.fs.FileSystem.WriteMode

/**
  * @Author Daniel
  * @Description Flink Sink——数据落地到文件
  *
  **/
object SinkFile {

  def main(args: Array[String]): Unit = {
    //设置用户名,避免权限错误
    System.setProperty("HADOOP_USER_NAME", "hadoop");
    val env = ExecutionEnvironment.getExecutionEnvironment
    val stu: DataSet[(Int, String, Double)] = env.fromElements(
      (19, "Wilson", 178.8),
      (17, "Edith", 168.8),
      (18, "Joyce", 174.8),
      (18, "May", 195.8),
      (18, "Gloria", 182.7),
      (21, "Jessie", 184.8)
    )

    println("-------------age从小到大升序排列(0->9)----------")
    stu.sortPartition(0, Order.ASCENDING).print

    println("-------------name从大到小降序排列(z->a)----------")
    stu.sortPartition(1, Order.DESCENDING).print

    println("-------------以age升序,height降序排列----------")
    stu.sortPartition(0, Order.ASCENDING)
      .sortPartition(2, Order.DESCENDING)
      .print

    println("-------------所有字段升序排列----------")
    stu.sortPartition("_", Order.ASCENDING).print

    case class Student(name: String, age: Int)
    val ds1: DataSet[(Student, Double)] = env.fromElements(
      (Student("Wilson", 19), 178.8),
      (Student("Edith", 17), 168.8),
      (Student("Joyce", 18), 174.8),
      (Student("May", 18), 195.8),
      (Student("Gloria", 18), 182.7),
      (Student("Jessie", 21), 184.8)
    )

    //Student.name升序
    //Parallelism>1将把path当成目录名称,Parallelism=1将把path当成文件名
    val ds2 = ds1.sortPartition("_1.age", Order.ASCENDING).setParallelism(1)

    //写入到HDFS,文本文档
    val output1 = "hdfs://bdedev/flink/Student001.txt"
    //NO_OVERWRITE模式下如果文件已经存在,则报错,OVERWRITE模式下如果文件已经存在,则覆盖
    ds2.writeAsText(output1, WriteMode.OVERWRITE)
    env.execute()

    //写入到HDFS,CSV文档
    val output2 = "hdfs://bdedev/flink/Student002.csv"
    ds2.writeAsCsv(output2, rowDelimiter = "\n", fieldDelimiter = "|||", WriteMode.OVERWRITE)
    env.execute()

  }
}

Flink编程三大组件(二)——Sink_Flink_02

2.自定义Sink

自定义的 sink 常见的有 Apache kafka、RabbitMQ、MySQL、ElasticSearch、Apache Cassandra、 Hadoop FileSystem 等,同理也可以定义自己的 Sink

MySink.scala

package blog.sink

import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}

import scala.collection.mutable

/**
  * @Author Daniel
  * @Description Flink Sink——判断是否将元素加入集合
  *
  **/
class MySink extends RichSinkFunction[(Boolean, (String, Double))] {
  private var resultSet: mutable.Set[(String, Double)] = _

  //初始执行一次
  override def open(parameters: Configuration): Unit = {
    //初始化内存存储结构
    resultSet = new mutable.HashSet[(String, Double)]
  }

  //每个元素执行一次
  override def invoke(v: (Boolean, (String, Double)), context: SinkFunction.Context[_]): Unit = {
    //主要逻辑
    if (v._1) {
      resultSet.add(v._2)
    }
  }

  //最后执行一次
  override def close(): Unit = {
    //打印
    resultSet.foreach(println)
  }
}

TestMySink.scala

package blog.sink

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, _}

/**
  * @Author Daniel
  * @Description Flink Sink——自定义Sink
  *
  **/
object TestMySink {

  def main(args: Array[String]): Unit = {

    val env = StreamExecutionEnvironment.getExecutionEnvironment

    val dS: DataStream[(Boolean, (String, Double))] = env.fromElements(
      (true, ("Wilson", 178.8)),
      (false, ("Edith", 168.8)),
      (true, ("Joyce", 174.8)),
      (true, ("May", 195.8)),
      (true, ("Gloria", 182.7)),
      (false, ("Jessie", 184.8))
    )
    //添加自定义Sink
    dS.addSink(new MySink())

    env.execute("Flink Custom Sink")
  }
}