目录
- 1. DataStream 常用 Transformations 算子
- 1.1 Map
- 1.2 FlatMap
- 1.3 Filter
- 1.4 KeyBy
- 1.5 Reduce
- 1.6 Fold
- 1.7 Aggregations
- 1.8 Union
- 1.9 Connect
- 1.10 Split + Select
- 2. 测流
- 2.1 Filter 分流
- 2.2 Split + Select 分流
- 2.3 测流输出
- 3. 自定义分区器
1. DataStream 常用 Transformations 算子
- 数据
hadoop,spark,flink
spark,hadoop
spark
1.1 Map
Takes one element and produces one element. A map function that doubles the values of the input stream
- 这是最简单的转换之一,其中输入是一个数据流,输出的也是一个数据流。
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
val stream = env.readTextFile("data/wc.txt")
stream.map(_.toLowerCase)
.print()
env.execute(this.getClass.getSimpleName)
}
- 控制台打印
3> hadoop,spark,flink
4> spark,hadoop
2> spark
1.2 FlatMap
Takes one element and produces zero, one, or more elements. A flatmap function that splits sentences to words
- 相当于先把里面的数据压扁,在进行 Map 操作
- FlatMap 采用一条记录并输出零个,一个或多个记录。
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
val stream = env.readTextFile("data/wc.txt")
stream.map(_.toLowerCase)
.print()
env.execute(this.getClass.getSimpleName)
}
- hadoop,spark,flink ==> hadoop spark flink
- 控制台输出
1> spark
1> hadoop
4> hadoop
3> spark
4> spark
4> flink
1.3 Filter
Evaluates a boolean function for each element and retains those for which the function returns true. A filter that filters out zero values
- 筛选数据,只保留该函数返回 true 的数据
- Filter 函数根据条件判断出结果。
1.4 KeyBy
Logically partitions a stream into disjoint partitions, each partition containing elements of the same key. Internally, this is implemented with hash partitioning. See keys on how to specify keys. This transformation returns a KeyedStream
- KeyBy 在逻辑上是基于 key 对流进行分区。在内部,它使用 hash 函数对流进行分区。它返回 KeyedDataStream 数据流。
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
val stream = env.readTextFile("data/wc.txt")
stream.flatMap(_.toLowerCase.split(","))
.map((_,1))
.keyBy(_._1)
.print()
env.execute(this.getClass.getSimpleName)
}
- 运行结果
4> (hadoop,1)
1> (spark,1)
4> (flink,1)
4> (hadoop,1)
1> (spark,1)
1> (spark,1)
1.5 Reduce
A "rolling" reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.
A reduce function that creates a stream of partial sums
- Reduce 返回单个的结果值,并且 reduce 操作每处理一个元素总是创建一个新值。常用的方法有 average, sum, min, max, count,使用 reduce 方法都可实现。
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
val stream = env.readTextFile("data/wc.txt")
stream.flatMap(_.toLowerCase.split(","))
.map((_,1))
.keyBy(_._1)
.reduce((x,y) => {
(x._1,x._2 + y._2)
})
.print()
env.execute(this.getClass.getSimpleName)
}
- 上面的操作相当于 sum 的操作
- 控制台输出
1> (spark,1)
4> (hadoop,1)
4> (flink,1)
1> (spark,2)
4> (hadoop,2)
1> (spark,3)
1.6 Fold
A "rolling" fold on a keyed data stream with an initial value. Combines the current element with the last folded value and emits the new value.
def fold[R: TypeInformation](initialValue: R)(fun: (R,T) => R): DataStream[R] = {
if (fun == null) {
throw new NullPointerException("Fold function must not be null.")
}
val cleanFun = clean(fun)
val folder = new FoldFunction[T,R] {
def fold(acc: R, v: T) = {
cleanFun(acc, v)
}
}
fold(initialValue, folder)
}
- 这个 API 使用的是柯里化函数
- 第一个参数给的一个默认的初始值,第二个参数传入的则是一个 function, 每个分组内在默认初始值的基础上面,在进行 function 的操作。
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
val stream = env.readTextFile("data/wc.txt")
stream.flatMap(_.toLowerCase.split(","))
.map((_,1))
.keyBy(_._1)
.fold(1)((x,y) => {
x + y._2
})
.print()
env.execute(this.getClass.getSimpleName)
}
- 运行结果
1> 2
4> 2
1> 3
4> 3
4> 2
1> 4
1.7 Aggregations
Rolling aggregations on a keyed data stream. The difference between min and minBy is that min returns the minimum value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy).
DataStream API 支持各种聚合,例如 min,max,sum 等。 这些函数可以应用于 KeyedStream 以获得 Aggregations 聚合。
KeyedStream.sum(0)
KeyedStream.sum("key")
KeyedStream.min(0)
KeyedStream.min("key")
KeyedStream.max(0)
KeyedStream.max("key")
KeyedStream.minBy(0)
KeyedStream.minBy("key")
KeyedStream.maxBy(0)
KeyedStream.maxBy("key")
1.8 Union
Union of two or more data streams creating a new stream containing all the elements from all the streams. Note: If you union a data stream with itself you will get each element twice in the resulting stream.
Union 函数将两个或多个数据流结合在一起。 这样就可以并行地组合数据流。 如果我们将一个流与自身组合,那么它会输出每个记录两次。
- union 只适应两个数据类型一致的流,数据类型不一致的流 union 则会报错。
- union 后面可以跟着多个参数,
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
val stream1 = env.readTextFile("data/wc.txt")
val stream2 = env.readTextFile("data/wc.txt")
stream1.union(stream2)
.print()
env.execute(this.getClass.getSimpleName)
}
- 运行结果
2> spark,hadoop
3> spark
4> spark
4> hadoop,spark,flink
1> spark,hadoop
1> hadoop,spark,flink
1.9 Connect
"Connects" two data streams retaining their types, allowing for shared state between the two streams.
- 两个数据类型不一样的流,可以通过该算子把两个流融合成一个流。
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
val stream1 = env.readTextFile("data/wc.txt")
val stream2 = env.readTextFile("data/wc.txt").map(("test",_))
stream1.connect(stream2)
.map(x=>x,y=>y)
.print()
env.execute(this.getClass.getSimpleName)
}
- 运行结果
2> spark
1> (test,spark)
3> (test,spark,hadoop)
4> spark,hadoop
2> (test,hadoop,spark,flink)
3> hadoop,spark,flink
1.10 Split + Select
- Slpit
Split the stream into two or more streams according to some criterion.
- Select
Select one or more streams from a split stream.
- 具体的案例放在下面测流一起说明
2. 测流
- 大部分的DataStream API的算子的输出是单一输出,也就是某种数据类型的流。除了split算子,可以将一条流分成多条流,这些流的数据类型也都相同。process function的side outputs功能可以产生多条流,并且这些流的数据类型可以不一样。一个side output可以定义为OutputTag[X]对象,X是输出流的数据类型。process function可以通过Context对象发射一个事件到一个或者多个side outputs。
- 假如我们有这样一个需求,有一份省、市、姓名的数据,需要查询出来某某省的数据。
- 数据
安徽省,合肥市,spark
安徽省,芜湖市,hadoop
浙江省,杭州市,flink
浙江省,宁波市,hbase
2.1 Filter 分流
- 这种写法比较简单,但是需要查询多少指标,就需要扫描多少次文件。
package com.xk.bigdata.flink.datastream.sideoutput
import com.xk.bigdata.flink.datastream.domain.City
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
object SimpleSideOutput {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
val stream = env.readTextFile("data/city.txt")
val cityStream = stream.map(x => {
val splits = x.split(",")
City(splits(0), splits(1), splits(2))
})
val zhejiang = cityStream.filter(_.province == "浙江省")
val anhui = cityStream.filter(_.province == "安徽省")
zhejiang.print()
anhui.print()
env.execute(this.getClass.getSimpleName)
}
}
- 运行结果
2> City(安徽省,合肥市,spark)
1> City(浙江省,宁波市,hbase)
4> City(浙江省,杭州市,flink)
2> City(安徽省,芜湖市,hadoop)
2.2 Split + Select 分流
- 使用这种分流,可以只查询一次文件,底层相当于对所有的文件内容打了一个标签,每次只查询那个标签的数据,但是这种分流的缺点就是只能进行一次分流,比如我们查询完安徽省的数据之后,我还想要安徽省下面的合肥市的数据,无法在进行分流,运行就会报错。
package com.xk.bigdata.flink.datastream.sideoutput
import com.xk.bigdata.flink.datastream.domain.City
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
object SplitSideOutut {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
val stream = env.readTextFile("data/city.txt")
val cityStream = stream.map(x => {
val splits = x.split(",")
City(splits(0), splits(1), splits(2))
})
val splitStream = cityStream.split(x => {
if (x.province == "浙江省") {
Seq("zhejiang")
} else if (x.province == "安徽省") {
Seq("anhui")
} else {
Seq("other")
}
})
splitStream select "anhui" print()
splitStream.select("zhejiang").print()
env.execute(this.getClass.getSimpleName)
}
}
- 运行结果
2> City(安徽省,合肥市,spark)
1> City(浙江省,宁波市,hbase)
4> City(浙江省,杭州市,flink)
2> City(安徽省,芜湖市,hadoop)
2.3 测流输出
- 这是目前官网主要推荐的一种打标签的方式
package com.xk.bigdata.flink.datastream.sideoutput
import com.xk.bigdata.flink.datastream.domain.City
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.scala.{OutputTag, StreamExecutionEnvironment, _}
import org.apache.flink.util.Collector
object SideOututApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
val stream = env.readTextFile("data/city.txt")
val cityStream = stream.map(x => {
val splits = x.split(",")
City(splits(0), splits(1), splits(2))
})
val zhejiangTag = new OutputTag[City]("zhejiang")
val anhuiTag = new OutputTag[City]("anhui")
val otherTag = new OutputTag[City]("other")
val sideStream = cityStream.process(new ProcessFunction[City, City] {
override def processElement(value: City, ctx: ProcessFunction[City, City]#Context, out: Collector[City]): Unit = {
if (value.province == "浙江省") {
ctx.output(zhejiangTag, value)
} else if (value.province == "安徽省") {
ctx.output(anhuiTag, value)
} else {
ctx.output(otherTag, value)
}
}
})
val zhejiang = sideStream.getSideOutput(zhejiangTag)
val anhui = sideStream.getSideOutput(anhuiTag)
// 查看安徽省下面合肥市和芜湖市的数据
val hefeiTag = new OutputTag[City]("hefei")
val wuhuTag = new OutputTag[City]("wuhu")
val anhuiCityStream = anhui.process(new ProcessFunction[City, City] {
override def processElement(value: City, ctx: ProcessFunction[City, City]#Context, out: Collector[City]): Unit = {
if (value.city == "合肥市") {
ctx.output(hefeiTag, value)
} else if (value.city == "芜湖市") {
ctx.output(wuhuTag, value)
}
}
})
anhuiCityStream.getSideOutput[City](hefeiTag).print()
env.execute(this.getClass.getSimpleName)
}
}
- 运行结果
3> City(安徽省,合肥市,spark)
3. 自定义分区器
- 还是上面的那份数据,如果有个需求,想把相同省份的数据放在一个分区里面,这时候就要我们自定义分区器才能完成了。
package com.xk.bigdata.flink.datastream.partition
import com.xk.bigdata.flink.datastream.domain.City
import org.apache.flink.api.common.functions.Partitioner
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
object MyPartitionApp {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(3)
val stream = env.readTextFile("data/city.txt")
val cityStream = stream.map(x => {
val splits = x.split(",")
City(splits(0), splits(1), splits(2))
})
cityStream.partitionCustom[String](new MyPartition, 0)
.map(x => {
println("thread id is " + Thread.currentThread().getId + "values:" + x)
x
})
.print()
env.execute(this.getClass.getSimpleName)
}
}
class MyPartition extends Partitioner[String] {
override def partition(key: String, numPartitions: Int): Int = {
println("==========numPartitions:" + numPartitions)
if (key == "浙江省") {
0
} else if (key == "安徽省") {
1
} else {
2
}
}
}
- 运行结果
==========numPartitions:3
==========numPartitions:3
==========numPartitions:3
==========numPartitions:3
thread id is 72values:City(浙江省,杭州市,flink)
thread id is 73values:City(安徽省,合肥市,spark)
1> City(浙江省,杭州市,flink)
2> City(安徽省,合肥市,spark)
thread id is 72values:City(浙江省,宁波市,hbase)
thread id is 73values:City(安徽省,芜湖市,hadoop)
1> City(浙江省,宁波市,hbase)
2> City(安徽省,芜湖市,hadoop)