目录
一 注意
二 map
三 flatMap
四 filter
五 keyBy
六 shuffle
七 Connect和Union
八 简单滚动聚合算子
九 reduce
十 process
十一 对流重新分区的几个算子
一 注意
RichMapFunction,凡是大多数带Rich的,很多都是富函数,可以实现各种方法,例如实现open()与close()
open:每个并行度调用一次 适合用于初始化或者创建链接等操作
close:每个并行度调用一次 适合用于清理操作比如关闭链接 只有在读文件的时候会调用两次
具体使用可见map的代码
二 map
消费一个元素并产出一个元素
import com.atguigu.bean.WaterSensor;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class Flink01_TransForm_Map {
public static void main(String[] args) throws Exception {
//1.获取流的执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//2.从端口读取数据
// DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);
DataStreamSource<String> streamSource = env.readTextFile("input/sensor.txt");
System.out.println("111111111111111111");
SingleOutputStreamOperator<WaterSensor> map = streamSource.map(new MyMap());//.setParallelism(2);
System.out.println("22222222222222222222");
map.print();
env.execute();
}
public static class MyMap extends RichMapFunction<String,WaterSensor>{
/**
* 生命周期方法,最先被调用 每个并行度调用一次 适合用于初始化或者创建链接等操作
* @param parameters
* @throws Exception
*/
@Override
public void open(Configuration parameters) throws Exception {
System.out.println("open...");
}
/**
* 生命周期方法,最后被调用 每个并行度调用一次 适合用于清理操作比如关闭链接 只有在读文件的时候会调用两次
* @throws Exception
*/
@Override
public void close() throws Exception {
System.out.println("close....");
}
@Override
public WaterSensor map(String value) throws Exception {
//getRuntimeContext指的是运行时上下文对象,可以获取到更多的信息,比如TaskName,JobId,状态相关的
System.out.println(getRuntimeContext().getTaskName());
String[] split = value.split(",");
return new WaterSensor(split[0], Long.parseLong(split[1]), Integer.parseInt(split[2]));
}
}
}
三 flatMap
一个元素并产生零个或多个元素(一进多出,出可是0)
注意:不建议使用Lambda表达式,因为在使用Lambda表达式表达式的时候, 由于泛型擦除的存在, 在运行的时候无法获取泛型的具体类型, 全部当做Object来处理, 及其低效, 所以Flink要求当参数中有泛型的时候, 必须明确指定泛型的类型.
代码最后有Lambda表达式的正确写法.
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
public class Flink02_TranForm_FlatMap {
public static void main(String[] args) throws Exception {
//1.获取流的执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//2.从端口读取数据
DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);
//TODO 3.使用FlatMap将一行数据按照空格切分切出每一个单词
streamSource.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
String[] words = value.split(" ");
for (String word : words) {
out.collect(word);
}
}
}).print();
env.execute();
}
}
Lambda表达式正确写法
env
.fromElements(1, 2, 3, 4, 5)
.flatMap((Integer value, Collector<Integer> out) -> {
out.collect(value * value);
out.collect(value * value * value);
}).returns(Types.INT)
.print();
四 filter
作用:过滤;根据指定的规则将满足条件(true)的数据保留,不满足条件(false)的数据丢弃
import com.atguigu.bean.WaterSensor;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class filter {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> streamSource = env.readTextFile("input/yao.txt");
streamSource.setParallelism(1);
//3.使用Map将端口读过来的数据转为JavaBean
SingleOutputStreamOperator<WaterSensor> waterSensorDStream = streamSource.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String value) throws Exception {
String[] split = value.split(" ");
return new WaterSensor(split[0], Long.parseLong(split[1]), Integer.parseInt(split[2]));
}
});
//TODO 4.使用Filter过滤出id为s1的数据
waterSensorDStream.filter(new FilterFunction<WaterSensor>() {
@Override
public boolean filter(WaterSensor value) throws Exception {
return "s1".equals(value.getId());
}
}).print();
env.execute();
}
}
五 keyBy
作用:具有相同key的元素会分到同一个分区中.一个分区中可以有多重不同的key.
import com.atguigu.bean.WaterSensor;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class Flink04_TranForm_Keyby {
public static void main(String[] args) throws Exception {
//1.获取流的执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(4);
//2.从端口读取数据
DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);
//3.使用Map将端口读过来的数据转为JavaBean
SingleOutputStreamOperator<WaterSensor> waterSensorDStream = streamSource.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String value) throws Exception {
String[] split = value.split(",");
return new WaterSensor(split[0], Long.parseLong(split[1]), Integer.parseInt(split[2]));
}
}).setParallelism(2);
//TODO 4.将相同id的数据聚合到一块
KeyedStream<WaterSensor, String> keyedStream = waterSensorDStream.keyBy(new KeySelector<WaterSensor, String>() {
@Override
public String getKey(WaterSensor value) throws Exception {
return value.getId();
}
});
waterSensorDStream.print("原始分区").setParallelism(2);
keyedStream.print("keyBy");
env.execute();
}
}
六 shuffle
作用:把流中的元素随机打乱.
env
.fromElements(10, 3, 5, 9, 20, 8)
.shuffle()
.print();
env.execute();
七 Connect和Union
Connect相当于同床异梦,虽然进行数据匹配,但是保持他们类型的数据流;Union水乳交融,彻底合并
union之前两个流的类型必须是一样,connect可以不一样
connect只能操作两个流,union可以操作多个。
import org.apache.flink.streaming.api.datastream.ConnectedStreams;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoMapFunction;
public class connect {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStreamSource<String> String = env.fromElements("a", "b", "c", "d");
DataStreamSource<Integer> Int = env.fromElements(1, 2, 3, 4);
ConnectedStreams<java.lang.String, Integer> connect = String.connect(Int);
connect.map(new CoMapFunction<java.lang.String, Integer, String>() {
@Override
public java.lang.String map1(java.lang.String value) throws Exception {
return value;
}
@Override
public java.lang.String map2(Integer value) throws Exception {
return Integer.toString(value*10);
}
}).print();
env.execute();
}
}
stream1
.union(stream2)
.union(stream3)
.print();
八 简单滚动聚合算子
sum, min,maxminBy,maxBy ; 因为都是KeyedStream类的,所以必须使用在KeyBy之后,结果都是DataStream
注意:
1 来一条聚合一条.
作用范围,都是分组内。
3 加by的区别(例如:max和,maxby):
max:取指定字段的当前的最大值,如果有多个字段,其他非比较字段,以第一条为准
maxBy:取指定字段的当前的最大值,如果有多个字段,其他字段以最大值那条数据为准;
如果出现两条数据都是最大值,由第二个参数决定: true => 其他字段取 比较早的值; false => 其他字段,取最新的值
九 reduce
作用:一个分组数据流的聚合操作,合并当前的元素和上次聚合的结果,产生一个新的值,返回的流中包含每一次聚合的结果,而不是只返回最后一次聚合的最终结果。
import com.atguigu.bean.WaterSensor;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class reduce {
public static void main(String[] args) throws Exception {
//1.获取流的执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//2.从端口读取数据
DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);
//3.使用Map将端口读过来的数据转为JavaBean
SingleOutputStreamOperator<WaterSensor> waterSensorDStream = streamSource.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String value) throws Exception {
String[] split = value.split(",");
return new WaterSensor(split[0], Long.parseLong(split[1]), Integer.parseInt(split[2]));
}
});
//4.将相同id的数据聚合到一块
KeyedStream<WaterSensor, Tuple> keyedStream = waterSensorDStream.keyBy("id");
//TODO 5.使用Reduce求Vc的最大值
keyedStream.reduce(new ReduceFunction<WaterSensor>() {
@Override
public WaterSensor reduce(WaterSensor value1, WaterSensor value2) throws Exception {
System.out.println("recude....");
return new WaterSensor(value1.getId(), value2.getTs(), Math.max(value1.getVc(), value2.getVc()));
}
}).print();
env.execute();
}
}
import com.atguigu.bean.WaterSensor;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class reduce {
public static void main(String[] args) throws Exception {
//1.获取流的执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//2.从端口读取数据
DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);
//3.使用Map将端口读过来的数据转为JavaBean
SingleOutputStreamOperator<WaterSensor> waterSensorDStream = streamSource.map(new MapFunction<String, WaterSensor>() {
@Override
public WaterSensor map(String value) throws Exception {
String[] split = value.split(",");
return new WaterSensor(split[0], Long.parseLong(split[1]), Integer.parseInt(split[2]));
}
});
//4.将相同id的数据聚合到一块
KeyedStream<WaterSensor, Tuple> keyedStream = waterSensorDStream.keyBy("id");
//TODO 5.使用Reduce求Vc的最大值
keyedStream.reduce(new ReduceFunction<WaterSensor>() {
@Override
public WaterSensor reduce(WaterSensor value1, WaterSensor value2) throws Exception {
System.out.println("recude....");
return new WaterSensor(value1.getId(), value2.getTs(), Math.max(value1.getVc(), value2.getVc()));
}
}).print();
env.execute();
}
}
十 process
作用:很多类型的流上都可以调用,可以从流中获取更多的信息(不仅仅数据本身);在Flink没有相应逻辑代码时使用,例如去重
注意:在keyBy之前的流上使用,new ProcessFunction;在keyBy之后的流上使用,new KeyedProcessFunction
public class Flink10_TransForm_Process {
public static void main(String[] args) throws Exception {
//1.获取流的执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//2.从端口读取数据
DataStreamSource<String> streamSource = env.socketTextStream("localhost", 9999);
//3.TODO 使用process实现flatMap的功能 将数据转为Tuple2元组
SingleOutputStreamOperator<Tuple2<String, Integer>> wordToOneStream = streamSource.process(new ProcessFunction<String, Tuple2<String, Integer>>() {
@Override
public void processElement(String value, Context ctx, Collector<Tuple2<String, Integer>> out) throws Exception {
String[] words = value.split(" ");
for (String word : words) {
out.collect(Tuple2.of(word, 1));
}
}
});
//4.将相同单词的数据聚合到一块
KeyedStream<Tuple2<String, Integer>, Tuple> keyedStream = wordToOneStream.keyBy(0);
//5.TODO 使用Process实现Sum的功能
keyedStream.process(new KeyedProcessFunction<Tuple, Tuple2<String, Integer>, Tuple2<String, Integer>>() {
//定义一个累加器,保存上一次累加后结果 bug 这个遍量中保存的至并没有区分key,所有的key都可以获取到这个值并改变
// private Integer lastSum = 0;
private HashMap<String, Integer> lastSumMap = new HashMap<>();
@Override
public void processElement(Tuple2<String, Integer> value, Context ctx, Collector<Tuple2<String, Integer>> out) throws Exception {
//1.判断当前单词是否在Map中并作为key
if (lastSumMap.containsKey(value.f0)){
//2.根据key取出之前累加后的结果
Integer lastSum = lastSumMap.get(value.f0);
//3.在此基础上加1
Integer curSum = lastSum + 1;
//4.将累加后的值输出并更新到Map中
out.collect(Tuple2.of(value.f0,curSum));
lastSumMap.put(value.f0, curSum);
}else {
//如果当前这条数据没有在Map中保存,证明是这个key的第一条数据
lastSumMap.put(value.f0, 1);
out.collect(Tuple2.of(value.f0, 1));
}
}
}).print();
env.execute();
}
}
十一 对流重新分区的几个算子
KeyBy:先按照key分组, 按照key的双重hash来选择后面的分区
shuffle:对流中的元素随机分区
reblance:对流中的元素平均分布到每个区.当处理倾斜数据的时候, 进行性能优化
rescale:同 rebalance一样, 也是平均循环的分布数据。但是要比rebalance更高效, 因为rescale不需要通过网络, 完全走的"管道"。