FLINK
1 关于Flink介绍
1.1 flink的模块分布
Hadoop-MR(map\reduce)–>Tez(DAG,批式处理)–》Spark(DAG,批式,Spark Streaming:micro batch)–>FLINK(batch、streaming)
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ibzLcIC1-1615266600693)(001.png)]
1.2 flink的底层结构
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wRc2HHZl-1615266600695)(002.png)]
1.3 常见的组件
1.3.1 JobManager
JobManager是Flink系统的协调者,负责收集所有job的状态信息以及集群中的所有从节点:TaskManager。还可以负责管理task,checkpoint以及自动故障转移。JobManager包含了以下3个重要组件:
- Actor system:一个容器,负责调度等服务。
- scheduler : 在Flink中executors被称为task slot,每个taskmanagerd都需要有一个或者多个task slot。在内部,Flink决定哪些task共享slot
- checkpoint : 容错
1.3.2 TaskManager
taskmanager就类似于spark中的worker。在jvm中执行要给或者多个线程。task执行的并行度取决于taskmanager的task slot的数量决定。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dWSRFs2C-1615266600696)(003.png)]
1.3.3 Client
当用户提交一个应用给flink的时候,先创建一个客户端。client会对用户提交的flink程序进行预处理,所以客户端需要设置一些参数比如,我的程序要提交到jobmanager的位置。换言之client会将用户提交的flink程序组装成一个job。
1.4 flink的流式处理和批式处理
flink在内部有一个缓存块为单位进行网络数据传输,用户可以自己配置这个缓存块的超时时间。比如你将缓存块的超时时间设置为0,则flink的数据传输方式就是“全实时”,系统处理方式就是最低限度的延迟。如果你将这个缓存块的超时时间设置为无限大,那么flink在处理数据上就类似于批式处理。
2 安装flink(****)
2.1 local
- jdk1.8
- hadoop-2.7.6/2.8.1
##1. 安装
[root@chancechance software]# tar -zxvf flink-1.6.1-bin-hadoop2 7-scala_2.11.tgz -C /opt/apps/
[root@chancechance flink-1.6.1]# vi /etc/profile
export FLINK_HOME=/opt/apps/flink-1.6.1
export PATH=$PATH:$FLINK_HOME/bin
##2. 启动
[root@chancechance flink-1.6.1]# start-cluster.sh
[root@chancechance flink-1.6.1]# stop-cluster.sh
##3. 测试
ip:8081
2.2 standalone
3台虚拟机
qphone01:jobmanager
qphone02/qphone03:taskmanager
tip:我这里搭建的微分布式
##1. flink-conf.yaml
#==============================================================================
# Common
#==============================================================================
#jobmanager的ip
jobmanager.rpc.address: 10.206.0.4
#jobmanager的port
jobmanager.rpc.port: 6123
#jobmanager在jvm中分配的堆内存的大小
jobmanager.heap.size: 1024m
#taskmanager在jvm中分配的堆内存的大小
taskmanager.heap.size: 1024m
#指定每个taskmanager上的task slot的数量
taskmanager.numberOfTaskSlots: 1
#flink程序默认的并行度
parallelism.default: 1
#==============================================================================
# Web Frontend
#==============================================================================
#flink的wen ui的端口号
rest.port: 8081
#==============================================================================
# Advanced
#==============================================================================
#flink的数据的临时保存目录
io.tmp.dirs: /opt/apps/flink-1.6.1/tmp
#在启动taskmanager的时候是否需要预分配内存给他
taskmanager.memory.preallocate: false
##2. slaves
10.206.0.4
##3. 如果式全分布式,就将flink-1.6.1的目录拷贝给其他的节点
##4. 启动集群
##4.1 方式1
[root@chancechance flink-1.6.1]# start-cluster.sh
[root@chancechance flink-1.6.1]# stop-cluster.sh
##4.2 方式2
jobmanager.sh start/start-foreground cluster | stop | stop-all
taskmanager.sh start|start-foreground | stop | stop-all
##5. 测试程序
[root@chancechance flink-1.6.1]# fink run /opt/apps/flink-1.6.1/examples/batch/WordCount.jar \
> --input /opt/apps/hive-1.2.1/logs/metastore.log \
> --output /home/output/00
2.3 yarn模式安装(*****)
至少hadoop-2.2以上
hdfs/yarn
2.3.1 Flink On Yarn的两种方式
有两种方式:
- 会在yarn开辟一块资源专门用于运行flink集群,这块资源式一直被占用,除非手动停止
- 每次提交任务开启一个新的flink集群,每次都开辟一个新的,两次运行之间完全没有影响,方便以后的管理,推荐后者。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Qa0axk1s-1615266600698)(004.png)]
##1. 启动yarn/hdfs
start-dfs.sh/start-yarn.sh
##2. 启动yarn模式
##2.1 第一种方式:
yarn-session.sh -n 2 -jm 1024 -tm 1024
flink run /opt/apps/flink-1.6.1/examples/batch/WordCount.jar \
--input hdfs://10.206.0.4:9000/input/1.data \
--output hdfs://10.206.0.4:9000/output/out.data
##2.2 第二种方式
flink run -m yarn-cluster -yn 2 -yjm 1024 -ytm 1024 /opt/apps/flink-1.6.1/examples/batch/WordCount.jar \
--input hdfs://10.206.0.4:9000/input/1.data \
--output hdfs://10.206.0.4:9000/output/out2.data
tip:
环境变量:HADOOP_HOME或HADOOP_CONF_DIR或YARN_CONF_DIR
2.4 ha的搭建
[root@chancechance bin]# start-zookeeper-quorum.sh/stop-zookeeper-quorum.sh
flink-conf.yaml
#==============================================================================
# High Availability
#==============================================================================
high-availability: zookeeper
high-availability.zookeeper.quorum: 10.206.0.4:2181
high-availability.zookeeper.client.acl: open
high-availability.storageDir: hdfs:///flink/ha/
2.5 flink on yarn的执行底层
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fCzPNaAE-1615266600699)(005.png)]
2.6 flink scala shell
start-scala-shell.sh [local|remote|yarn] [options] <args>...
[root@chancechance bin]# start-cluster.sh
[root@chancechance bin]# start-scala-shell.sh remote 10.206.0.4 8081
scala> val text = benv.fromElements("hello wangjunjie", "hello junjie")
text: org.apache.flink.api.scala.DataSet[String] = org.apache.flink.api.scala.DataSet@41492479
scala> val cnts = text.flatMap(_.toLowerCase.split("\\s+")).map((_,1)).groupBy(0).sum(1)
cnts: org.apache.flink.api.scala.AggregateDataSet[(String, Int)] = org.apache.flink.api.scala.AggregateDataSet@75ff2b6d
scala> cnts.print
(hello,2)
(junjie,1)
(wangjunjie,1)
3 flink api
3.1 搭建环境
<!-- flink java -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
</dependency>
<!-- flink scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
</dependency>
3.2 WordCount_java
package cn.qphone.flink.day1;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
public class Demo1_Wordcount_Java {
public static void main(String[] args) throws Exception {
//1. 参数准备
int port;
try {
//1. 获取到参数工具类,作用加载你传递的参数
ParameterTool parameterTool = ParameterTool.fromArgs(args);
port = parameterTool.getInt("port");
}catch (Exception e) {
System.err.println("no port set, default:port is 6666");
port = 6666;
}
String hostname = "10.206.0.4";
//2. 获取到编程的入口
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//3. 通过web socket:获取到数据
DataStreamSource<String> data = env.socketTextStream(hostname, port);
SingleOutputStreamOperator<WordWithCount> pairWords = data.flatMap(new FlatMapFunction<String, WordWithCount>() {
public void flatMap(String line, Collector<WordWithCount> out) throws Exception {
String[] split = line.split("\\s+");
for (String word : split) {
out.collect(new WordWithCount(word, 1L));
}
}
});
KeyedStream<WordWithCount, Tuple> grouped = pairWords.keyBy("word");
WindowedStream<WordWithCount, Tuple, TimeWindow> window = grouped.timeWindow(Time.seconds(2), Time.seconds(1));
SingleOutputStreamOperator<WordWithCount> cnts = window.sum("count");
cnts.print().setParallelism(1);
env.execute("wordcount");
}
public static class WordWithCount {
public String word;
public long count;
public WordWithCount() {
}
public WordWithCount(String word, long count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return "WordWithCount{" +
"word='" + word + '\'' +
", count=" + count +
'}';
}
}
}
3.3 WordCount_scala
package cn.qphone.flink.day1
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
object Demo1_WordCount_Scala {
def main(args: Array[String]): Unit = {
//1. 准备参数
var port:Int = try {
ParameterTool.fromArgs(args).getInt("port")
} catch {
case e:Exception => System.err.println("no port set, default:port is 6666")
6666
}
//2. 获取数据
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val data: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
//3. 导入隐式参数
import org.apache.flink.api.scala._
//4. 计算
val cnts: DataStream[WordWithScalaCount] = data.flatMap(_.split("\\s+")).map(WordWithScalaCount(_, 1)).keyBy("word")
.timeWindow(Time.seconds(2), Time.seconds(1))
.reduce((a, b) => WordWithScalaCount(a.word, a.count + b.count))
//5. 结果打印
cnts.print().setParallelism(1)
//6. 执行
env.execute("wordcount scala")
}
case class WordWithScalaCount(word:String, count:Int)
}
3.4 打包
##1. 将代码打包,然后上传到指定的服务器节点中并运行
[root@chancechance bin]# flink run /opt/software/flink-parent.jar --port 6666
##2. 异常
org.apache.flink.client.program.ProgramInvocationException: Neither a 'Main-Class', nor a 'program-class' entry was found in the jar file
原因:flink在执行这个jar包的时候找不到jar中的入口
解决方式:使用另外方式打包
##3. 要使用idea自带的打包方式去指定主类,执行之后的结果文件
standalone
/opt/apps/flink-1.6.1/log/flink-root-taskexecutor-0-chancechance.out
yarn
/opt/apps/hadoop-2.8.1/logs/userlogs/application_xxxx/container_xxxx_000002/taskmanager.out
3.5 窗口的理解
每隔一段时间统计一段时间间隔的数据
两个重要的参数:窗口长度、时间间隔
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YiMN1crI-1615266600700)(006.png)]
3.6 Flink API分层
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-31ipiNjb-1615266600701)(007.png)]
4 流式API操作(*****)
4.1 DataStream的Source
4.1.1 自带的Source
package cn.qphone.flink.day2
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import scala.collection.mutable.ListBuffer
import org.apache.flink.api.scala._
object Demo1_DataStreamSource {
def main(args: Array[String]): Unit = {
//1. 获取上下文
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//2. source
//2.1 list source
val list: ListBuffer[Int] = ListBuffer(10, 20, 30)
val dStream: DataStream[Int] = env.fromCollection(list)
val mapStream1: DataStream[Int] = dStream.map(_ * 100)
mapStream1.print().setParallelism(1)
println("=============================================")
//2.2 string source
val dStream2: DataStream[String] = env.fromElements("wangjunjie hen shuai")
dStream2.print().setParallelism(1)
println("=============================================")
//2.3 文件作为源
val dStream3: DataStream[String] = env.readTextFile("file:///C:\\real_win10\\day30-flink\\doc\\笔记.md", "utf-8")
dStream3.print().setParallelism(1)
println("=============================================")
//2.4 socket作为源
val dStream4: DataStream[String] = env.socketTextStream("146.56.208.76", 6665)
dStream4.print().setParallelism(1)
println("=============================================")
//启动
env.execute("collection source")
}
}
4.1.2 自定义的Source
自定义DataStream的Source:
- 继承SourceFunction:非并行,不能指定其并行度。不能指定setParallelism(1),如:socketTextStreamFunction
- 继承ParallelSourceFunction:是一个并行的SourceFunction,可以指定并行度
- 继承RichParallelSourceFunction:实现了ParallelSourceFunction,不但能够并行,还有其他功能,比如增加了open和close、getRuntimeContext。。。
package cn.qphone.flink.day2
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import scala.util.Random
/**
* 自定义DataStream的Source:
* 1. 继承SourceFunction:非并行,不能指定其并行度。不能指定setParallelism(1),如:socketTextStreamFunction
* 2. 继承ParallelSourceFunction:是一个并行的SourceFunction,可以指定并行度
* 3. 继承RichParallelSourceFunction:实现了ParallelSourceFunction,不但能够并行,还有其他功能,比如增加了open和close、getRuntimeContext。。。
*/
object Demo2_DataStreamCustomSource {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//1. 添加自定义的source
val dataStream1: DataStream[String] = env.addSource(new MyRichParallelSourceFunction)
dataStream1.print().setParallelism(1)
env.execute("custom source")
}
}
class MySourceFunction extends SourceFunction[String] {
/**
* 向下游产生数据
*/
override def run(ctx: SourceFunction.SourceContext[String]): Unit = {
val random = new Random()
while (true) {
val num: Int = random.nextInt(100)
ctx.collect(s"random:${num}")
Thread.sleep(500)
}
}
/**
* 取消,用于控制run方法的结束
*/
override def cancel(): Unit = {
}
}
class MyRichParallelSourceFunction extends RichParallelSourceFunction[String] {
override def run(ctx: SourceFunction.SourceContext[String]): Unit = {
val random = new Random()
while (true) {
val num: Int = random.nextInt(1000)
ctx.collect(s"random_rich:${num}")
Thread.sleep(1000)
}
}
override def cancel(): Unit = ???
/**
* 初始化方法
*/
override def open(parameters: Configuration): Unit = super.open(parameters)
/**
* 适合在关闭的时候处理
*/
override def close(): Unit = super.close()
}
4.1.3 自定义source-mysql
4.1.3.1 建表
create table `flink`.`stu1`(
`id` int(11) default null,
`name` varchar(32) default null
) engine=InnoDB default charset=utf8;
<!-- jdbc driver -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
4.1.3.2 自定义source
package cn.qphone.flink.day2
import java.sql.{Connection, DriverManager, PreparedStatement, ResultSet}
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import scala.beans.BeanProperty
object Demo3_DataStreamCustomSource_Mysql {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val mysqlStream: DataStream[Stu1] = env.addSource(new MysqlSourceFunction)
mysqlStream.print().setParallelism(1)
env.execute("mysql source")
}
}
case class Stu1(id:Int, name:String)
class MysqlSourceFunction extends RichParallelSourceFunction[Stu1] {
@BeanProperty var ps:PreparedStatement = _
@BeanProperty var conn:Connection = _
@BeanProperty var rs:ResultSet = _
/**
* 初始化
*/
override def open(parameters: Configuration): Unit = {
super.open(parameters)
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://146.56.208.76:3306/flink?useSSL=false"
val username = "root"
val password = "wawyl1314bb*"
Class.forName(driver)
try {
conn = DriverManager.getConnection(url, username, password)
val sql = "select * from stu1"
ps = conn.prepareStatement(sql)
} catch {
case e:Exception => e.printStackTrace()
}
}
override def run(ctx: SourceFunction.SourceContext[Stu1]): Unit = {
try {
rs = ps.executeQuery
while (rs.next()) {
val stu1: Stu1 = Stu1(rs.getInt("id"), rs.getString("name"))
ctx.collect(stu1)
}
} catch {
case e:Exception => e.printStackTrace()
}
}
override def cancel(): Unit = {
}
override def close(): Unit = {
super.close()
if (conn != null) conn.close()
if (ps != null) ps.close()
if (rs != null) rs.close()
}
}
4.1.4 flink-jdbc-InputFormat
4.1.4.1 导入依赖
<!-- 如果flink是1.6.1就使用前者,否则从1.7.0开始就用后者 -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-jdbc</artifactId>
<verison>1.6.1</verison>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-jdbc_2.11</artifactId>
<verison>1.7.0</verison>
</dependency>
4.1.4.2 代码
package cn.qphone.flink.day2
import org.apache.flink.api.common.typeinfo.{BasicTypeInfo, TypeInformation}
import org.apache.flink.api.java.io.jdbc.JDBCInputFormat
import org.apache.flink.api.java.typeutils.RowTypeInfo
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.types.Row
object Demo4_DataStreamJdbcInputFormat {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//1. 设置inputformat
//1.1 创建rowtypeinfo
val fieldsType: Array[TypeInformation[_]] = Array[TypeInformation[_]](
BasicTypeInfo.INT_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO
)
val rowTypeInfo = new RowTypeInfo(fieldsType:_*)
//1.2 定义jdbcinputformat
val jdbcInputFormat: JDBCInputFormat = JDBCInputFormat.buildJDBCInputFormat()
.setDBUrl("jdbc:mysql://146.56.208.76:3306/flink?useSSL=false")
.setDrivername("com.mysql.jdbc.Driver")
.setUsername("root")
.setPassword("wawyl1314bb*")
.setQuery("select * from stu1")
.setRowTypeInfo(rowTypeInfo)
.finish()
val jdbcStream: DataStream[Row] = env.createInput(jdbcInputFormat)
jdbcStream.print().setParallelism(1)
env.execute("jdbc input format")
}
}
4.1.5 source-kafka
4.1.5.1 导入依赖(大家自动将flink升级到1.9.1)
<!-- flink 2 kafka -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.9.1</version>
</dependency>
4.1.5.2 consumer.properties
bootstrap.servers=146.56.208.76:9092
group.id=hzbigdata_flink
auto.offset.reset=largest
4.1.4.3 kafka
##1. 启动kafka
##2. 创建主题
[root@chancechance apps]# kafka-topics.sh --create --topic flink --zookeeper 10.206.0.4/kafka --partitions 1 --replication-factor 1
Created topic flink.
4.1.5.4 代码
package cn.qphone.flink.day2
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.apache.flink.api.scala._
/**
* kafka source
*/
object Demo5_DataStreamCusomSource_Kafka {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.load(this.getClass.getClassLoader.getResourceAsStream("consumer.properties"))
val topic = "flink"
val kafkaDataStream: DataStream[String] = env.addSource(new FlinkKafkaConsumer(topic, new SimpleStringSchema(), properties))
kafkaDataStream.print("kafka source--->").setParallelism(1)
env.execute("kafka source")
}
}
4.1.5.5 开启生产者生产数据测试
[root@chancechance apps]# kafka-console-producer.sh --topic flink --broker-list 10.206.0.4:9092
4.2 DataStream的transformation
4.2.1 flatMap/map/filter/keyBy/Split/select/reduce/aggregation/union
package cn.qphone.flink.day2
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
object Demo6_Transformation_Filter {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val socketStream: DataStream[String] = env.socketTextStream("146.56.208.76", 6665)
val wcStream: DataStream[WordCount] = socketStream.flatMap(_.split("\\s+")).filter(_.length > 4).map(WordCount(_, 1))
.keyBy("word").timeWindow(Time.seconds(2), Time.seconds(1))
.sum("cnt")
wcStream.print().setParallelism(1)
env.execute(this.getClass.getSimpleName)
}
case class WordCount(word:String, cnt:Int)
}
4.2.2 split和select
package cn.qphone.flink.day2
import org.apache.flink.streaming.api.scala.{DataStream, SplitStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
/**
* split : DataStream -> SplitStream
* 作用:将DataStream拆分成多个流,用splitStream
* select : SplitStream -> DataStream
* 作用:和split搭配使用,从splitStream中选择一个或者多个流组成一个新的DataStream
*
* 1,wangjunjie,man,180,180
*/
object Demo7_Transformation_Split_Select {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val socketStream: DataStream[String] = env.socketTextStream("146.56.208.76", 6665)
val splitStream: SplitStream[User] = socketStream.map(info => {
val arr: Array[String] = info.split(",")
val uid: String = arr(0).trim
val name: String = arr(1)
val sex: String = arr(2)
val height: Double = arr(3).toDouble
val weight: Double = arr(4).toDouble
User(uid, name, sex, height, weight)
}).split((user: User) => {
if (user.name.equals("wangjunjie")) Seq("old")
else Seq("new")
})
splitStream.select("old").print("wangjunjie666").setParallelism(1) // 判断子流被标记为大佬,打汪俊杰
splitStream.select("new").print("didid").setParallelism(1) // 判断子流被标记为马仔,打所有人都是弟弟!!!!!
env.execute("split select transformation")
}
}
case class User(id:String, name:String, sex:String, height:Double, weight:Double)
4.2.3 union和connect
4.2.3.1 union
package cn.qphone.flink.day2
import org.apache.flink.streaming.api.scala.{DataStream, SplitStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
/**
* union: DataStream* --> DataStream
* 作用:和spark sql中的union类似
*
* connect : * --> ConnectedStream
* 作用:将两个流进行连接,两个流的类型可以不同,两个流会共享状态
*/
object Demo8_Transformation_Union_Connect {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val socketStream: DataStream[String] = env.socketTextStream("146.56.208.76", 6665)
val splitStream: SplitStream[User] = socketStream.map(info => {
val arr: Array[String] = info.split(",")
val uid: String = arr(0).trim
val name: String = arr(1)
val sex: String = arr(2)
val height: Double = arr(3).toDouble
val weight: Double = arr(4).toDouble
User(uid, name, sex, height, weight)
}).split((user: User) => {
if (user.name.equals("wangjunjie")) Seq("old")
else Seq("new")
})
val oldStream: DataStream[User] = splitStream.select("old")
val newStream: DataStream[User] = splitStream.select("new")
val unionStream: DataStream[User] = oldStream.union(newStream)
unionStream.print("union 合并结果 :").setParallelism(1)
env.execute("union")
}
}
4.2.3.2 connect
package cn.qphone.flink.day2
import org.apache.flink.streaming.api.scala.{ConnectedStreams, DataStream, SplitStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
/**
* union: DataStream* --> DataStream
* 作用:和spark sql中的union类似
*
* connect : * --> ConnectedStream
* 作用:将两个流进行连接,两个流的类型可以不同,两个流会共享状态
*/
object Demo8_Transformation_Union_Connect {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val socketStream: DataStream[String] = env.socketTextStream("146.56.208.76", 6665)
val splitStream: SplitStream[User] = socketStream.map(info => {
val arr: Array[String] = info.split(",")
val uid: String = arr(0).trim
val name: String = arr(1)
val sex: String = arr(2)
val height: Double = arr(3).toDouble
val weight: Double = arr(4).toDouble
User(uid, name, sex, height, weight)
}).split((user: User) => {
if (user.name.equals("wangjunjie")) Seq("old")
else Seq("new")
})
val bigStream: DataStream[(String, String)] = splitStream.select("old").map(e => (e.name, s"大佬"))
val smallStream: DataStream[(String, String)] = splitStream.select("new").map(e => (e.name, s"马仔"))
val connectedStream: ConnectedStreams[(String, String), (String, String)] = bigStream.connect(smallStream)
//connectedStream 不能直接打印
connectedStream.map(
big => ("name is " + big._1, "info is " + big._2)
,
small => ("small is" + small._1, "info is " + small._2)
).print()
env.execute("connect")
}
}
4.2.4 keyBy和reduce
package cn.qphone.flink.day3
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
/**
* keyBy : DataStream --> KeyedStream
* 作用:将具有相同的key的数据分配到一个区中。内部使用的是散列分区。类似于sql中的group by。后续获取到keyedStream的操作都是基于组内的操作。
* reduce : 聚合
* 作用:将数据合并成为一个新的数据,返回单个结果值。并且reduce在处理我们元素的时候总是会创建一个新的值,要使用它需要针对分组或者window来执行
*/
object Demo1_Transformation_Keyby_Reduce {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.fromElements(Tuple2(200, 33), Tuple2(100,66), Tuple2(100, 56), Tuple2(200, 666))
.keyBy(0) // 按照数组/元组第一个元素来进行分组
// .reduce((t1, t2) => (t1._1, t1._2+t2._2))
.sum(1) // 聚合算子
.print()
.setParallelism(1)
env.execute()
}
}
4.3 DataStream的Sink
writeAsText
writeAsCsv
writeUsingOutputFormat
writeToSocket
addSink
4.3.1 自带的sink
package cn.qphone.flink.day3
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaProducer, FlinkKafkaProducer011}
import org.apache.kafka.common.serialization.ByteArraySerializer
object Demo1_Sink_Basic {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
// val dStream: DataStream[(Int, Int)] = env.fromElements(Tuple2(200, 33), Tuple2(100, 66), Tuple2(100, 56), Tuple2(200, 666))
val dStream: DataStream[String] = env.fromElements("李熙", "利息")
dStream.print() // 输出到控制台
dStream.writeAsText("file:///C:\\real_win10\\day31-flink\\resource\\out\\1.txt")
dStream.writeAsCsv("file:///C:\\real_win10\\day31-flink\\resource\\out\\2.csv")
dStream.writeToSocket("146.56.208.76", 9999, new SimpleStringSchema())
dStream.print() // 输出到控制台
env.execute()
}
}
4.3.2 kafka sink
package cn.qphone.flink.day3
import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaProducer, FlinkKafkaProducer011}
import org.apache.kafka.common.serialization.ByteArraySerializer
object Demo1_Sink_Basic {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val dStream: DataStream[String] = env.fromElements("李熙", "利息")
//kafka
val topic = "flink"
val properties = new Properties()
properties.setProperty("bootstrap.servers", "146.56.208.76:9092")
properties.setProperty("key.serializer", classOf[ByteArraySerializer].getName)
properties.setProperty("value.serializer", classOf[ByteArraySerializer].getName)
val sink = new FlinkKafkaProducer[String](topic, new SimpleStringSchema(), properties)
dStream.addSink(sink)
env.execute()
}
}
4.3.3 mysql的OutputFormat
4.3.3.1 建表
create table `flink`.`obtain_employment`(
`id` int not null,
`name` varchar(32) not null,
`salary` double not null,
`address` varchar(32) not null
);
4.3.3.2 代码
package cn.qphone.flink.day3
import java.sql.{Connection, DriverManager, PreparedStatement, ResultSet}
import org.apache.flink.api.common.io.OutputFormat
import org.apache.flink.api.common.typeinfo.{BasicTypeInfo, TypeInformation}
import org.apache.flink.api.java.io.jdbc.{JDBCInputFormat, JDBCOutputFormat}
import org.apache.flink.api.java.typeutils.RowTypeInfo
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import scala.beans.BeanProperty
/**
* 1 liujiahao 30000 shanghai
* 2 张辉 23000 北京
* 3 程志远 25000 杭州
*/
object Demo3_Mysql_OutputFormat {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val obtainEmploymentStream: DataStream[ObtainEmployment] = env.socketTextStream("146.56.208.76", 6666)
.map(line => {
val obtainEmployment: Array[String] = line.split("\\s+")
println(obtainEmployment.mkString(","))
ObtainEmployment(obtainEmployment(0).toInt, obtainEmployment(1), obtainEmployment(2).toDouble, obtainEmployment(3))
})
obtainEmploymentStream.print().setParallelism(1)
obtainEmploymentStream.writeUsingOutputFormat(new MysqlOutputFormat)
env.execute()
}
}
case class ObtainEmployment(id:Int, name:String, salary:Double, address:String)
class MysqlOutputFormat extends OutputFormat[ObtainEmployment] {
@BeanProperty var ps:PreparedStatement = _
@BeanProperty var conn:Connection = _
@BeanProperty var rs:ResultSet = _
/**
* 用于配置相关的初始化
*/
override def configure(parameters: Configuration): Unit = {
}
/**
* 业务初始化
*/
override def open(taskNumber: Int, numTasks: Int): Unit = {
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://146.56.208.76:3306/flink?useSSL=false"
val username = "root"
val password = "wawyl1314bb*"
Class.forName(driver)
try {
conn = DriverManager.getConnection(url, username, password)
} catch {
case e:Exception => e.printStackTrace()
}
}
/**
* 写记录
*/
override def writeRecord(record: ObtainEmployment): Unit = {
ps = conn.prepareStatement("insert into obtain_employment values(?, ?, ?, ?)")
ps.setInt(1, record.id)
ps.setString(2, record.name)
ps.setDouble(3, record.salary)
ps.setString(4, record.address)
ps.execute()
}
/**
* 最后被调用
*/
override def close(): Unit = {
if (conn != null) conn.close()
if (ps != null) ps.close()
if (rs != null) rs.close()
}
}
4.3.4 flink中kafka的二阶段提交
4.3.4.1 2pc
2-phase commit,2pc。是最基础的分布式一致性协议。在分布式系统中,为了让每个节点都感知到其他节点的执行的事务情况,引入了一个中心节点,是一个协调员(coodinator),用来统一处理所有节点的执行逻辑。被中心节点调度的其他的业务节点我们称之为参与者(participant)
简单来说,2pc将分布式事务分为两个阶段:1准备(提交请求)和2执行(提交)。coodinator会根据participant的响应来决定是否真正执行事务。响应者会响应ok或者yes,就提交。否则就终止。需要注意的是zookeeper中的数据一致性也采取了2pc协议。画图如下:
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-p2dRbRkL-1615266600702)(008.png)]
4.3.4.2 2pc在flink中的应用
在flink中的2pc应用FlinkKafkaProducer。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bNG4K7Xr-1615266600702)(009.png)]
假设一种场景,从kafka的source拉取数据;经过一次窗口聚合。最后将数据再发送到kafka的sink。
- jobmanager向source发送lixi,开始pre-commit阶段。当还在内部阶段的时候他是不需要执行的操作,初始化一些状态的变量。当checkpoint成功的时候才负责将变量写入,否则终止。
- 当source收到lixi,将自身的状态进行保存。后端可以配置进行选择,状态指的是消费的每个分区的偏移量。将lixi发送到下一个组件去
- 当window接收到lixi后,对自己状态进行保存。再window这里状态指的聚合的结果。然后将lixi发送给下一个组件。sink接收到lixi之后也会有优先保存自己状态,然后进行一次预提交。
- 预提交成功之后,jobmanager会通知每一个组件,这一轮的检查点就完成了。这个时候kafka sink就会向kafka进行真正的事务提交了。
以上就是2个阶段完成流程,提交过程中如果失败有以下两种情况:
- pre-commit失败,就恢复到最近的一次checkpoint位置
- 一旦pre-commit成功,必须要 保证commit成功。因此,所有的组件必须要和checkpoint达成共识,所有的组件以commit为标准,要么全部执行成功,要么全部终止回滚。
4.3.5 redis sink
4.3.5.1 导入依赖
<!-- flink2 redis -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
</dependency>
4.3.5.2 代码
package cn.qphone.flink.day4
import cn.qphone.flink.day3.ObtainEmployment
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}
object Demo1_Sink_Reids {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val obtainEmploymentStream: DataStream[ObtainEmployment] = env.socketTextStream("146.56.208.76", 6666)
.map(line => {
val obtainEmployment: Array[String] = line.split("\\s+")
println(obtainEmployment.mkString(","))
ObtainEmployment(obtainEmployment(0).toInt, obtainEmployment(1), obtainEmployment(2).toDouble, obtainEmployment(3))
})
val tStream: DataStream[(String, String)] = obtainEmploymentStream.map(oe => (oe.name, oe.address))
tStream.print().setParallelism(1)
// 创建redis sink
val config: FlinkJedisPoolConfig = new FlinkJedisPoolConfig.Builder()
.setHost("146.56.208.76")
.setPort(6379)
.build()
val sink = new RedisSink(config, new MyRedisSink)
tStream.addSink(sink)
env.execute("redis sink")
}
}
/**
* 自定义redis sink
*/
class MyRedisSink extends RedisMapper[(String, String)] {
/**
*
* @return
*/
override def getCommandDescription: RedisCommandDescription = {
new RedisCommandDescription(RedisCommand.SET, null)
}
/**
* key
* @param t
* @return
*/
override def getKeyFromData(t: (String, String)): String = {
return t._1
}
/**
* value
* @param t
* @return
*/
override def getValueFromData(t: (String, String)): String = {
return t._2
}
}
4.3.6 ElasticSearch sink
4.3.6.1 安装es
4.3.6.2 导入依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch6_2.11</artifactId>
<version>${flink-version}</version>
</dependency>
4.3.6.3 代码
package cn.qphone.flink.day4
import java.util
import cn.qphone.flink.day3.ObtainEmployment
import org.apache.flink.api.common.functions.RuntimeContext
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.connectors.elasticsearch.{ElasticsearchSinkFunction, RequestIndexer}
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink
import org.apache.http.HttpHost
import org.elasticsearch.action.index.IndexRequest
import org.elasticsearch.client.Requests
object Demo2_Sink_ElasticSearch {
def main(args: Array[String]): Unit = {
//1. source
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val obtainEmploymentStream: DataStream[ObtainEmployment] = env.socketTextStream("146.56.208.76", 6666)
.map(line => {
val obtainEmployment: Array[String] = line.split("\\s+")
println(obtainEmployment.mkString(","))
ObtainEmployment(obtainEmployment(0).toInt, obtainEmployment(1), obtainEmployment(2).toDouble, obtainEmployment(3))
})
obtainEmploymentStream.print().setParallelism(1)
//2. 整合es sink
//2.1 集合指定es的位置
val httpHosts = new util.ArrayList[HttpHost]()
httpHosts.add(new HttpHost("146.56.208.76", 9200, "http"))
//2.2 获取到sink对象
val sink = new ElasticsearchSink.Builder[ObtainEmployment](httpHosts, new MyEsSink).build()
//2.3 添加sink
obtainEmploymentStream.addSink(sink)
env.execute("sind 2 es")
}
}
class MyEsSink extends ElasticsearchSinkFunction[ObtainEmployment] {
/**
* 当当前DataStream中每流动一个元素,此方法调用一次
* @param t 数据
* @param runtimeContext
* @param requestIndexer
*/
override def process(element: ObtainEmployment, runtimeContext: RuntimeContext, requestIndexer: RequestIndexer): Unit = {
//1. 将javabean对象的数据封装到java的map中
println(s"$element")
val map = new util.HashMap[String, String]()
map.put("name", element.name)
map.put("address", element.address)
//2. 将map中的数据构造一个IndexRequest请求
val request: IndexRequest = Requests.indexRequest()
.index("flink") // 索引库
.`type`("info") // 索引类型
.id(s"${element.id}") // docid
.source(map) // 数据
//3. 将索引请求对象传递给请求索引器
requestIndexer.add(request)
}
}
5 批操作API(*****)
5.1 常见source
package cn.qphone.flink.day4
import org.apache.flink.api.scala._
object Demo3_DataSet_Source {
def main(args: Array[String]): Unit = {
//1 获取到入口类
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
// val ds: DataSet[String] = env.readTextFile("file:///C:\\real_win10\\day32-flink\\resource\\1.txt")
// ds.print()
val ds2: DataSet[String] = env.fromElements("lixi", "rock", "lee")
ds2.print()
val ds3: DataSet[Long] = env.generateSequence(1, 100)
ds3.print()
//2. 执行
env.execute("source batch")
}
}
5.2 常见的算子
map
flatmap
mappartition
filter
distinct
group by
reduce
max
min
sum
join
union
hashpartiton
RangeParition
...
5.3 Sink
writeAsText
writeAsCsv
print
write
output
5.3.1 自定义mysql的outputformat
5.3.1.1 建表
create table `flink`.`wc`(
`word` varchar(32),
`count` int(11)
)
5.3.1.2 代码
package cn.qphone.flink.day5
import java.sql.{Connection, DriverManager, PreparedStatement, ResultSet}
import org.apache.flink.api.common.io.OutputFormat
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.configuration.Configuration
import scala.beans.BeanProperty
object Demo1_DataSet_Source {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
val text: DataSet[String] = env.fromElements("i love flink very much")
val txt = text.flatMap(_.split("\\s+")).map((_, 1)).groupBy(0).sum(1).map(t => Wc(t._1, t._2))
//1. 添加自定义outputformat
txt.output(new BatchMysqlOutputFormat)
// env.execute()
}
}
case class Wc(word:String, count:Int)
class BatchMysqlOutputFormat extends OutputFormat[Wc] {
@BeanProperty var ps:PreparedStatement = _
@BeanProperty var conn:Connection = _
@BeanProperty var rs:ResultSet = _
override def configure(parameters: Configuration): Unit = {
}
override def open(taskNumber: Int, numTasks: Int): Unit = {
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://146.56.208.76:3306/flink?useSSL=false"
val username = "root"
val password = "wawyl1314bb*"
Class.forName(driver)
try {
conn = DriverManager.getConnection(url, username, password)
} catch {
case e:Exception => e.printStackTrace()
}
}
override def writeRecord(record: Wc): Unit = {
ps = conn.prepareStatement("insert into wc values(?, ?)")
ps.setString(1, record.word)
ps.setInt(2, record.count)
ps.execute()
}
override def close(): Unit = {
if (conn != null) conn.close()
if (ps != null) ps.close()
if (rs != null) rs.close()
}
}
6 TaskManager于Task Slot(****)
6.1 基本概念
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jTSUOziq-1615266600703)(010.png)]
在flink中每一个TaskManager相当于式spark中的worker(yarn中的nodemanager);换言之每个taskmanager实际上都是jvm中的一条进程。在一个taskmanager下会被分配很多的task slot,这个task slot相当于spark中的executor(yarn中的container),主要作用式隔离不同的task对资源的要求。默认策略式均分。一个taskmanager能够同时接收多少task执行,是由task slot决定(一个taskmanager至少有一个task slot)
默认情况下,如果两个task在不同的task slot下他们是使用不同的资源的。但是flink也允许共享task slot,但是有一个前提条件,两个task必须得是同一个job下的task.
6.2 并行度
task的Parallelism可以在flink的不同的级别上指定。
- 算子级别:dataStream.print.setParallelism(1)
- 执行环境:StreamExecutionEnvironment和ExecutionEnvironment;env.setParallelism(1)
- 客户端
- 配置文件: parallelism.default: 1
6.3 Operator Chain
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-OXzpFUhI-1615266600703)(011.png)]
- StreamGraph(客户端)
- JobGraph(客户端)
- ExecutionGraph
- 物理执行图
object Demo1_WordCount_Scala {
def main(args: Array[String]): Unit = {
//2. 获取流式执行环境:批式执行环境ExecutionEnvironment
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//3. 导入隐式参数
import org.apache.flink.api.scala._
//4. 计算
//map操作符不能往前链接,但是往后链接。说白了就是map和print连接到一起
env.fromElements("i love you").map((_,1)).startNewChain().print()
//map操作符不能再链接到其他的操作符了——禁止链接map操作符
env.fromElements("i hate you").map((_,1)).disableChaining().print()
//map操作符放入到指定的task共享组中去共享task slot
env.fromElements("i hate you").map((_,1)).slotSharingGroup("default")
//6. 触发执行
env.execute("wordcount scala")
}
}
7 分区(***)
7.1 分区器概念
spark的RDD中有分区概念。Flink针对DataStream也有分区的概念。通过StreamPartitioner父类完成了flink分区的实现:常见的分区器有8种:
- ShufflePartitioner:洗牌分区器,将所有的输出随机的选择下游
- BroadcastPartitioner:广播分区器,将记录转发给下游的所有节点
- CustomPartitonWrapper:自定义分区器
- ForwardPartitioner:将分区器的记录转发本地运行的下游的operator
- GlobalPartitioner : 默认的分区器
- KeyGroupStreamPartitioner:通过记录的value获取到分区key
- RebalancePartitioner
- RescalePartitioner:可扩展的分区器,通过轮询的方式将记录传递给下游
7.2 分区器代码
object Demo2_Partitioner {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val data = env.fromElements("i love you very much")
data.shuffle.print("shuffle-->").setParallelism(4) // ShufflePartitoner
data.rescale.print("rescale-->").setParallelism(4) // RescalePartitoner
data.rebalance.print("rebalance-->").setParallelism(4)
env.execute()
}
}
8 state、checkpoint、state backend、savepoint(***)
8.1 State:状态
8.1.1 介绍
state:flink中的function和operator都是有状态的,在处理数据过程种存储的数据就是状态。
8.1.2 为什么要有状态
因为flink主要是用来做流式计算,7x24小时的不间断的计算。并且消费的数据应该是不重复、不间断、不丢失并且只被计算一次。这些都是属于flink的状态,当我们在生产中扩展并行度、防止服务器崩溃等等的操作的时候都需要保证flink的数据状态,flink有关于他状态管理的API:
KeyedState
ManagedState
分类 | KeyedState | ManagedState |
使用场景 | 只能用在KeyedStream | 使用所有的算子 |
处理方式 | 每个key对应一个state,一个operator可能处理多个state | 一个operator只能处理一个state |
并发改变 | state随着key在实例间迁移 | 当并发改变时,需要你选择分配方式: |
访问方式 | 通过RuntimeContext,自己实现RichFunction | CheckpointFunction |
支持数据结构 | ValuesState、ListState、… | 支持ListState |
8.1.3 ValueState代码
package cn.qphone.flink.day5
import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.util.Collector
object Dem3_State {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
env.fromElements((1,5), (1, 6), (1, 7), (2, 8), (2, 1))
.keyBy(0)
.flatMap(new MyFlatMapFunction)
.print
env.execute()
}
}
/**
* 自定义State
* * 每个算子起始都有对应*Function或Rich*Function
* * 一般我们在自定义的时候都会继承这个函数对应的富函数(AbstractRichFunction)
* * 实现这个富函数中的抽象方法
*/
class MyFlatMapFunction extends RichFlatMapFunction[(Int,Int), (Int,Int)] {
var state:ValueState[(Int, Int)] = _
/**
* 初始化
*/
override def open(parameters: Configuration): Unit = {
val descriptor = new ValueStateDescriptor[(Int, Int)](
"avg",
TypeInformation.of(new TypeHint[(Int, Int)] {}),
(0, 0)
)
state = getRuntimeContext.getState(descriptor) // int sum = 0
}
override def flatMap(value: (Int, Int), out: Collector[(Int, Int)]): Unit = {
// 获取到当前的状态
val currentState: (Int, Int) = state.value() // sum - 0
val count: Int = currentState._1 + 1 // 1
val sum: Int = currentState._2 + value._2 // 0+1
//更新状态
state.update((count, sum)) // sum=0+1
//输出状态
out.collect(value._1, sum)
}
/**
* 释放资源
*/
override def close(): Unit = super.close()
}
8.2 Checkpoint
8.2.1 checkpoint概念以及流程
checkpoint是针对flink的job进行周期性的state的快照,便于作业的恢复以及稳定。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jtov09TI-1615266600704)(012.png)]
flink会在我们的数据中加入一个barrier(栅栏),栅栏从source开始到sink结束,过程每一个算子触碰到barrier都会自动进行快照
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NY05BSkH-1615266600705)(013.png)]
8.2.2 全局配置——flink-conf.yarml
#==============================================================================
# Fault tolerance and checkpointing
#==============================================================================
# The backend that will be used to store operator state checkpoints if
# checkpointing is enabled.
#
# Supported backends are 'jobmanager', 'filesystem', 'rocksdb', or the
# <class-name-of-factory>.
#
# state.backend: filesystem
# Directory for checkpoints filesystem, when using any of the default bundled
# state backends.
#
# state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints
# Default target directory for savepoints, optional.
#
# state.savepoints.dir: hdfs://namenode-host:port/flink-checkpoints
# Flag to enable/disable incremental checkpoints for backends that
# support incremental checkpoints (like the RocksDB state backend).
#
# state.backend.incremental: false
8.3 state backend
8.3.1 介绍
默认情况下state都是保存在taskmanager的内存中。checkpoint则会存储在jobmanager的内存中。
state backend分为3类:
- MemoryStateBackend:state本质是保存在jvm中的堆中,执行checkpoint的时候会将state保存在jobmanager的内存中
- FsStateBackend:state数据保存在taskmanager的内存之中,执行checkpoint的时候会将state保存在我们配置的文件系统中。
- RocketsDBStateBackend:它会在本地文件系统之中维护state,state被直接写入到rocketDB中。同时它需要配置一个远程URI(一般来说都是HDFS),作用在checkpoint的时候将数据复制到远程的URI中。使用RocketsDB有一个最大的好处是克服了state受限于内存大小的限制,同时又能够将其checkpoint到远程的fs中。
8.3.2 代码
package cn.qphone.flink.day5
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
object Demo4_RocketsDB_Backend {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
//获取checkpoint的配置对象
val config: CheckpointConfig = env.getCheckpointConfig
/**
* DELETE_ON_CANCELLATION:取消作业的时候删除checkpoint
* RETAIN_ON_CANCELLATION:取消作业的时候保留checkpoint
*/
config.enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)
// 设置EXACTLY_ONCE模式
config.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
// 设置每个检查点最小间隔时间1s
config.setMinPauseBetweenCheckpoints(1000)
// 设置每次checkpoint快照必须在1分钟之内结束
config.setCheckpointTimeout(60000)
// 设置在统一时间范围内只允许一个检查点
config.setMaxConcurrentCheckpoints(1)
val data = env.fromElements("i love you very much").print()
env.execute()
}
}
8.4 checkpoint和savepoint的区别
- 概念:checkpoint用来容错的,savepoint保存的全局的状态
- 目的:checkpoint程序自动容错快速恢复;savepoint是针对程序修改或者程序升级的。
- 用户交互:checkpoint是flink的行为,savepoint是用户出发的。说得再直白点,checkpoint的创建删除和修改都是flink管理的,savepoint需要用户自己创建删除修改。
- 状态文件的保留策略:默认是程序自动删除。savepoint对一直保存,除非用户自己手动的删除
- checkpint使用state backend
9 Flink的广播变量
package cn.qphone.flink.day5
import org.apache.flink.api.common.state.MapStateDescriptor
import org.apache.flink.api.common.typeinfo.BasicTypeInfo
import org.apache.flink.streaming.api.datastream.BroadcastStream
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.util.Collector
object Demo5_Broadcast {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
val desc = new MapStateDescriptor(
"sexinfo",
BasicTypeInfo.INT_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO
)
val sex: DataStream[(Int, String)] = env.fromElements((1, "男"), (2, "女"))
val sexB: BroadcastStream[(Int, String)] = sex.broadcast(desc)
/**
* lixi 1
* wangjunjie 1
* lihao 2
*/
val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
val map: DataStream[(String, Int)] = socket.map(line => {
val fields: Array[String] = line.split("\\s+")
val name: String = fields(0)
val sexid: Int = fields(1).toInt
(name, sexid)
})
map.connect(sexB).process(new BroadcastProcessFunction[(String, Int), (Int, String), (String, String)] {
// 处理元素
override def processElement(value: (String, Int),
ctx: BroadcastProcessFunction[(String, Int), (Int, String), (String, String)]#ReadOnlyContext,
out: Collector[(String, String)]): Unit = {
val genderid: Int = value._2 // 获取到map数据集合中的性别编号
var gender: String = ctx.getBroadcastState(desc).get(genderid) //从广播变量中通过编号获取到性别字符串
if (gender == null) gender = "人妖"
//输出
out.collect((value._1, gender))
}
// 继续处理广播元素
override def processBroadcastElement(value: (Int, String),
ctx: BroadcastProcessFunction[(String, Int), (Int, String), (String, String)]#Context,
out: Collector[(String, String)]): Unit = {
ctx.getBroadcastState(desc).put(value._1, value._2)
}
}).print()
env.execute()
}
}
10 flink的分布式缓存
package cn.qphone.flink.day5
import java.io.File
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import scala.collection.mutable
import scala.io.{BufferedSource, Source}
import scala.collection.mutable.Map
/**
* gender.txt:1 男,2 女
*/
object Demo6_Distribute_cache {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
//读取hdfs的资源并设置分布式缓存
env.registerCachedFile("file:///c://flink_cache/gender.txt", "info")
env.socketTextStream("146.56.208.76", 6666)
.map(new RichMapFunction[String, (String,String)] {
var bc:BufferedSource = _
val map: mutable.Map[Int, String] = Map() // 做的缓存
override def open(parameters: Configuration): Unit = {
//读取分布式缓存中的数据
val file: File = getRuntimeContext.getDistributedCache.getFile("info") // 获取分布式缓存中的数据
bc = Source.fromFile(file)
val list: List[String] = bc.getLines().toList
for(line <- list) {
val fields: Array[String] = line.split("\\s+")
val sexid: Int = fields(0).toInt
val sex: String = fields(1)
map.put(sexid, sex)
}
}
override def map(line: String): (String, String) = {
val fields: Array[String] = line.split("\\s+")
val name: String = fields(0)
val sexid: Int = fields(1).toInt
val sex: String = map.getOrElse(sexid, "妖")
(name, sex)
}
override def close(): Unit = {
if(bc != null) bc.close()
}
}).print()
env.execute()
}
}
11 累加器
object Demo1_Accumulator {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
val data = env.fromElements("i love you very much")
data.flatMap(new MyFlatMapFunction(env)).print()
//4. 获取累加的结果
val result: JobExecutionResult = env.execute()
println(result.getAccumulatorResult("count_name").toString)
}
}
class MyFlatMapFunction extends RichFlatMapFunction[String, String] {
private var env:StreamExecutionEnvironment = _
def this(env:StreamExecutionEnvironment) {
this
this.env = env
}
override def flatMap(value: String, out: Collector[String]): Unit = {
//1. 创建累加器
val counter = new IntCounter
//2. 注册累加器
getRuntimeContext.addAccumulator("count_name", counter)
//3. 当达成了某种条件
counter.add(1)
}
}
12 Window和Time(****)
12.1 window
12.1.1 场景
最近一段时间、每隔一段时间、最近多少条数据的统计
实时计算,但是对结果的实时性要求不高
对数据延迟也可以接收,可以使用window
12.1.2 概念
将无界的数据流拆分为有界的数据流。flink支持大致上两种窗口:时间驱动和数据驱动
12.1.3 分类
- 时间窗口
- 混动时间窗口
- 滑动事件窗口
- 会话窗口
- 数据窗口
- 滑动计数窗口
- 混动数据窗口
12.1.3.1 滚动窗口
特点:
时间对齐
窗口长度固定
没有重叠
比如:
求某个时间段的聚合,使用滚动就比较合适
12.1.3.2 滑动窗口
特点:
窗口长度固定
有重叠
比如:
求近几天的数据统计
12.1.3.3 代码测试
package cn.qphone.flink.day6
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows
import org.apache.flink.streaming.api.windowing.time.Time
//import org.apache.flink.streaming.api.windowing.time.Time
object Demo2_Window {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
/**
* 20201111 chongqing 3
* 20201112 hangzhou 2
*/
socket.map(line => {
val fields: Array[String] = line.split("\\s+")
val date: String = fields(0).trim
val province: String = fields(1)
val add: Int = fields(2).toInt
(date+"_"+province, add)
}).keyBy(0)
// .timeWindow(Time.seconds(5)) // 滚动窗口,只统计当前窗口数据
// .timeWindow(Time.seconds(5), Time.seconds(5)) // 滑动窗口
// .countWindow(3) // 滚动
// .countWindow(5,2) // 滑动
// .window(EventTimeSessionWindows.withGap(Time.milliseconds(1000)))
.sum(1).print()
env.execute()
}
}
12.1.3.4 窗口聚合函数
- sum
- reduce
package cn.qphone.flink.day6
import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
//import org.apache.flink.streaming.api.windowing.time.Time
object Demo2_Window {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
/**
* 20201111 chongqing 3
* 20201112 hangzhou 2
*/
socket.map(line => {
val fields: Array[String] = line.split("\\s+")
val date: String = fields(0).trim
val province: String = fields(1)
val add: Int = fields(2).toInt
(date+"_"+province, add)
}).keyBy(0)
.timeWindow(Time.seconds(1)) // 滚动窗口,只统计当前窗口数据
// .timeWindow(Time.seconds(5), Time.seconds(5)) // 滑动窗口
// .countWindow(3) // 滚动
// .countWindow(5,2) // 滑动
// .window(EventTimeSessionWindows.withGap(Time.milliseconds(1000)))
.aggregate(new AggregateFunction[(String,Int), (String,Int,Int), (String,Int)] {
/**
* 初始化
* @return
*/
override def createAccumulator(): (String, Int, Int) = ("", 0, 0)
/**
* 每读取一条记录,累加一次
* @return
*/
override def add(value: (String, Int), accumulator: (String, Int, Int)): (String, Int, Int) = {
val cnt: Int = accumulator._2 + 1
val sum: Int = accumulator._3 + value._2
(value._1, cnt, sum)
}
/**
* 获取结果
* @param accumulator
* @return
*/
override def getResult(accumulator: (String, Int, Int)): (String, Int) = {
(accumulator._1, accumulator._3 / accumulator._2)
}
/**
* 多个分区结果合并
* @return
*/
override def merge(partition1: (String, Int, Int), partition2: (String, Int, Int)): (String, Int, Int) = {
val cnt: Int = partition1._2 + partition2._2
val sum: Int = partition1._3 + partition2._3
(partition1._1, cnt, sum)
}
}).print()
env.execute()
}
}
12.1.3.5 窗口处理函数
package cn.qphone.flink.day6
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
object Demo2_Window {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
/**
* 20201111 chongqing 3
* 20201112 hangzhou 2
*/
socket.map(line => {
val fields: Array[String] = line.split("\\s+")
val date: String = fields(0).trim
val province: String = fields(1)
val add: Int = fields(2).toInt
(date+"_"+province, add)
}).keyBy(0)
.timeWindow(Time.seconds(1)) // 滚动窗口,只统计当前窗口数据
// .timeWindow(Time.seconds(5), Time.seconds(5)) // 滑动窗口
// .countWindow(3) // 滚动
// .countWindow(5,2) // 滑动
// .window(EventTimeSessionWindows.withGap(Time.milliseconds(1000)))
.process[(String, Double)](new ProcessWindowFunction[(String, Int), (String, Double), Tuple, TimeWindow] {
override def process(key: Tuple, context: Context, elements: Iterable[(String, Int)], out: Collector[(String, Double)]): Unit = {
var cnt:Int = 0
var sum:Double = 0.0
//遍历记录
elements.foreach(record => {
cnt = cnt + 1
sum = sum + record._2
})
out.collect((key.getField(0), sum / cnt))
}
}).print()
env.execute()
}
}
12.1.4 触发器
trigger——触发器:出发窗口的执行操作
- EventTimeTrigger
- ProcessingTimeTrigger
- CountTrigger
如果用户没有设置触发器,flink会调用自己的默认触发器
12.1.4.1 自定义触发器
- CONTINUE : 对窗口不执行任何操作
- FIRE_AND_PURGE :对窗口的数据按照我们设计的窗口的代码来进行计算,并输出结果,最后清除窗口的数据
- FIRE :对窗口的数据按照我们设计的窗口的代码来进行计算,并输出结果。计算完毕之后窗口的数据不会被清除
- PURGE :将窗口的数据清除
package cn.qphone.flink.day6
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.triggers.{Trigger, TriggerResult}
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
object Demo3_Trigger {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
/**
* 20201111 chongqing 3
* 20201112 hangzhou 2
*/
socket.map(line => {
val fields: Array[String] = line.split("\\s+")
val date: String = fields(0).trim
val province: String = fields(1)
val add: Int = fields(2).toInt
(date+"_"+province, add)
}).keyBy(0)
.timeWindow(Time.seconds(5)) // 滚动窗口,只统计当前窗口数据
.trigger(new MyTrigger) // 设置触发器
.sum(1)
.print()
env.execute()
}
}
class MyTrigger extends Trigger[(String,Int), TimeWindow] {
var cnt:Int = 0
/**
* 每读取一个元素,此方法会被自动调用
*/
override def onElement(element: (String, Int), timestamp: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
// 注册时间触发器
ctx.registerProcessingTimeTimer(window.maxTimestamp()) // 当前窗口的最大值
println(window.maxTimestamp())
if (cnt > 5) {
println("触发的计数窗口")
cnt = 0
TriggerResult.FIRE
}else {
cnt = cnt + 1
TriggerResult.CONTINUE
}
}
/**
* 当ProcessTime的定时器被出发的时候调用
* @return
*/
override def onProcessingTime(time: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
println("触发的时间窗口")
TriggerResult.FIRE
}
/**
* 当eventTime的定时器被出发的时候调用
*/
override def onEventTime(time: Long, window: TimeWindow, ctx: Trigger.TriggerContext): TriggerResult = {
TriggerResult.CONTINUE
}
/**
* 窗口清楚的时候被调用
*/
override def clear(window: TimeWindow, ctx: Trigger.TriggerContext): Unit = {
ctx.deleteProcessingTimeTimer(window.maxTimestamp())
}
}
12.1.5 watermark——水位线
当flink在以eventtime处理数据流的时候,他会根据数据的时间戳来处理时间,就会导致数据乱序。所谓的乱序,起始就是指flink接收事件的先后顺序不是严格按照eventtime顺序排列的。
watermark是一种衡量eventTime进展的机制。它是由window来触发的。
[00:00:00, 00:00:03]
[00:00:03, 00:00:06]
[00:00:06, 00:00:09]
…
[00:00:57, 00:01:00]
12.1.5.1 有序的watermark
package cn.qphone.flink.day6
import java.text.SimpleDateFormat
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.scala.function.RichWindowFunction
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector // 导入flink 隐式转换函数
object Demo10_WaterMark {
/**
* lixi 12312312312312312
* @param args
*/
def main(args: Array[String]): Unit = {
//1. 获取到流式的环境对象
val env = StreamExecutionEnvironment.getExecutionEnvironment
val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
socket.filter(_.nonEmpty).map(line => {
val fields: Array[String] = line.split("\\s+")
//name, timestamp
(fields(0), fields(1).toLong)
}) // 分配时间戳和水位线
.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[(String, Long)] {
var maxTimestamp = 0L // 迄今位置最大的窗口的时间戳
val maxLazy = 10000L // 允许最大的延迟时间
val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
//获取当前的水位
override def getCurrentWatermark: Watermark = new Watermark(maxTimestamp - maxLazy)
//分配时间戳
override def extractTimestamp(element: (String, Long), previousElementTimestamp: Long): Long = {
val now_ts: Long = element._2 // 数据现在的时间戳
maxTimestamp = Math.max(now_ts, maxTimestamp)
val now_warter_ts: Long = getCurrentWatermark.getTimestamp
println(s"eventTime->${format.format(now_ts)}")
println(s"maxTimestamp->${format.format(maxTimestamp)}")
println(s"now_warter_ts->${format.format(now_warter_ts)}")
now_ts
}
}).keyBy(0).timeWindow(Time.seconds(3))
.apply(new RichWindowFunction[(String,Long),String, Tuple, TimeWindow] {
val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
override def apply(key: Tuple, window: TimeWindow, input: Iterable[(String, Long)], out: Collector[String]): Unit = {
val lst: List[(String, Long)] = input.iterator.toList.sortBy(_._2)
val startTime: String = format.format(window.getStart)
val endTIme: String = format.format(window.getEnd)
val res = s"start eventTime--> ${format.format(lst.head._2)}," +
s"end eventTime--> ${format.format(lst.last._2)}," +
s"window startTime --> ${startTime}," +
s"window startTime --> ${endTIme}"
out.collect(res)
}
}).print()
env.execute()
}
}
13 Table & SQL(*****)
13.1 依赖
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-common</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-java-bridge_2.11</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-scala-bridge_2.11</artifactId>
</dependency>
13.2 table quick start
package cn.qphone.flink.day6
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.types.Row
object Demo4_Table_QuickStart {
def main(args: Array[String]): Unit = {
//1. 获取到流式的环境对象
val env = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._
//2. 获取到table的环境对象
val tenv: StreamTableEnvironment = StreamTableEnvironment.create(env)
//3. 使用流式的环境对象获取到source的数据
val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
val data: DataStream[(String, Int)] = socket.map(line => {
val fields: Array[String] = line.split("\\s+")
val date: String = fields(0).trim
val province: String = fields(1)
val add: Int = fields(2).toInt
(date + "_" + province, add)
})
//4. 将datastream转换为一个table对象
var table: Table = tenv.fromDataStream(data)
//5. 使用table来查询数据
table = table.select("_1,_2").where("_2>2")
//6. 将table转换为datastream然后输出
tenv.toAppendStream[Row](table).print("table ->")
//7. 执行
env.execute()
}
}
13.3 table操作中包含了表的字段名
package cn.qphone.flink.day6
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.types.Row
object Demo5_Table_QuickStart2 {
def main(args: Array[String]): Unit = {
//1. 获取到流式的环境对象
val env = StreamExecutionEnvironment.getExecutionEnvironment
import org.apache.flink.api.scala._ // 导入flink 隐式转换函数
//2. 获取到table的环境对象
val tenv: StreamTableEnvironment = StreamTableEnvironment.create(env)
//3. 使用流式的环境对象获取到source的数据
val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
val data: DataStream[(String, Int)] = socket.map(line => {
val fields: Array[String] = line.split("\\s+")
val date: String = fields(0).trim
val province: String = fields(1)
val add: Int = fields(2).toInt
(date + "_" + province, add)
})
//4. 将datastream转换为一个table对象
import org.apache.flink.table.api.scala._ // 导入一个flink table的隐式转换函数
var table: Table = tenv.fromDataStream(data, 'date_province, 'cnt) // 将data数据转换为table的过程中并给其中每个元组对应的元素赋值(别名)
//5. 使用table来查询数据
table = table.select("date_province,cnt").where("cnt>2")
table.printSchema()
//6. 将table转换为datastream然后输出
tenv.toAppendStream[Row](table).print("table ->")
//7. 执行
env.execute()
}
}
13.4. table操作中整合批式开发
package cn.qphone.flink.day6
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.table.api.scala.BatchTableEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row
object Demo6_Table_Batch {
def main(args: Array[String]): Unit = {
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
val tenv: BatchTableEnvironment = BatchTableEnvironment.create(env)
val ds: DataSet[(Int, String, String, Int)] = env.fromElements("1,lixi,man,1").map(line => {
val fields: Array[String] = line.split(",")
(fields(0).toInt, fields(1), fields(2), fields(3).toInt)
})
val table: Table = tenv.fromDataSet(ds, 'id, 'name, 'sex, 'salary)
table.groupBy("name")
.select('name, 'salary.sum as 'sum_age)
.toDataSet[Row]
.print()
env.execute()
}
}
13.5 sql操作
package cn.qphone.flink.day6
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.types.Row
import org.apache.flink.api.scala._ // 导入flink 隐式转换函数
import org.apache.flink.table.api.scala._ // 导入一个flink table的隐式转换函数
object Demo7_SQL_QuickStart {
def main(args: Array[String]): Unit = {
//1. 获取到流式的环境对象
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2. 获取到table的环境对象
val tenv: StreamTableEnvironment = StreamTableEnvironment.create(env)
//3. 使用流式的环境对象获取到source的数据
val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
val data: DataStream[DPA] = socket.map(line => {
val fields: Array[String] = line.split("\\s+")
val date: String = fields(0).trim
val province: String = fields(1)
val add: Int = fields(2).toInt
DPA(date+"_"+province, add)
})
//4. 将datastream转换为一个table对象
var table: Table = tenv.fromDataStream[DPA](data) // 将data数据转换为table的过程中并给其中每个元组对应的元素赋值(别名)
//5. sql
tenv.sqlQuery(
s"""
|select
|*
|from
|$table
|where
|add > 2
|""".stripMargin).toAppendStream[Row].print()
//7. 执行
env.execute()
}
}
case class DPA(dp:String, add:Int)
13.6 sql操作时间类型
package cn.qphone.flink.day6
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.types.Row
import org.apache.flink.api.scala._ // 导入flink 隐式转换函数
import org.apache.flink.table.api.scala._ // 导入一个flink table的隐式转换函数
object Demo8_SQL_Time {
def main(args: Array[String]): Unit = {
//1. 获取到流式的环境对象
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2. 获取到table的环境对象
val tenv: StreamTableEnvironment = StreamTableEnvironment.create(env)
//3. 使用流式的环境对象获取到source的数据
val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
val data: DataStream[DPAT] = socket.map(line => {
val fields: Array[String] = line.split("\\s+")
val date: String = fields(0).trim
val province: String = fields(1)
val add: Int = fields(2).toInt
val ts: Long = fields(3).toLong
DPAT(date+"_"+province, add, ts)
})
//4. 将datastream转换为一个table对象
var table: Table = tenv.fromDataStream[DPAT](data) // 将data数据转换为table的过程中并给其中每个元组对应的元素赋值(别名)
//5. sql : tumble(时间值, 间隔时间),一般再group by后面用
tenv.sqlQuery(
s"""
|select
|dp,
|sum(add) as sum_cnt
|from
|$table
|group by dp
|""".stripMargin).toAppendStream[Row].print()
//7. 执行
env.execute()
}
}
case class DPAT(dp:String, add:Int, ts:Long)
13.7 wordcount
package cn.qphone.flink.day6
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.types.Row
import org.apache.flink.api.scala._ // 导入flink 隐式转换函数
import org.apache.flink.table.api.scala._ // 导入一个flink table的隐式转换函数
object Demo9_SQL_Wordcount{
def main(args: Array[String]): Unit = {
//1. 获取到流式的环境对象
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2. 获取到table的环境对象
val tenv: StreamTableEnvironment = StreamTableEnvironment.create(env)
//3. 使用流式的环境对象获取到source的数据
val socket: DataStream[String] = env.socketTextStream("146.56.208.76", 6666)
val word: DataStream[(String, Int)] = socket.flatMap(_.split("\\s+")).filter(_.nonEmpty).map((_,1))
//4. 将datastream转换为一个table对象
var table: Table = tenv.fromDataStream(word, 'word, 'cnt) // 将data数据转换为table的过程中并给其中每个元组对应的元素赋值(别名)
//5. sql : tumble(时间值, 间隔时间),一般再group by后面用
tenv.sqlQuery(
s"""
|select
|word,
|sum(cnt)
|from
|${table}
|group by word
|""".stripMargin).toRetractStream[Row].print()
//7. 执行
env.execute()
}
}
14 flink datastream和spark streaming的区别
- 数据模型:spark streaming是micro batch;而flink就是dataflow。
- 部署方式:spark的yarn模式分为client和cluster。flink提供了yarn-session和per job
- 提供资源组件:spark的executor;flink用的task slot
- api
- flink 2pc,spark未提供
- 窗口区别
- spark严格上来说实时性只能做到秒级别,flink严格上来说可以做到毫秒
- spark:DAG;flink operator chain
- spark table引擎catalyst;flink table calsite