目录
- DataStream API 基础
- 执行环境
- 源算子 source
- 6种方式添加数据源
- Flink支持的数据类型
- 转换算子 transformations
- 基本转换算子
- 聚合算子 Aggregation
- 用户自定义函数 UDF
- 物理分区
- 输出算子 sink
- 写入文件
- 写入Kafka
- 写入Redis
- 写入ES
- 写入JDBC
- 自定义输出
DataStream API 基础
执行环境
(1)StreamExecutionEnvironment.getExecutionEnvironment()
(2)StreamExecutionEnvironment.createLocalEnvironment()
(3)StreamExecutionEnvironment.createRemoteEnvironment()
流批一体,可以在提交任务命令中,设置是批处理模式、流处理模式、自动模式,-Dexecution.runtime-mode=BATCH。
源算子 source
6种方式添加数据源
1. env.readTextFile(文件路径和文件名) 有界流
2. env.fromCollection(集合) 测试用
3. env.fromElements(元素) 测试用
4. env.socketTextStream(hostname, port) 无界流
5. 从Kafka读取数据
env.addSource(new FlinkKafkaConsumer<String>("topicname", new SimpleStringSchema(), propertities));
完整版:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "hadoop102:9092");
properties.setProperty("group.id", "consumer-group");
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("auto.offset.reset", "latest");
// 从文件读取数据
DataStream<String> dataStream = env.addSource( new FlinkKafkaConsumer011<String>("sensor", new SimpleStringSchema(), properties));
// 打印输出
dataStream.print();
env.execute();
}
6. 自定义Source
public class SourceTest_UDF {
public static void main(String[] args) throws Exception{
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 从文件读取数据
DataStream<SensorReading> dataStream = env.addSource( new MySensorSource() );
// 打印输出
dataStream.print();
env.execute();
}
// 实现自定义的 SourceFunction
public static class MySensorSource implements SourceFunction<SensorReading>{
// 定义一个标识位,用来控制数据的产生
private boolean running = true;
@Override
public void run(SourceContext<SensorReading> ctx) throws Exception {
// 定义一个随机数发生器
Random random = new Random();
// 设置10个传感器的初始温度
HashMap<String, Double> sensorTempMap = new HashMap<>();
for(int i = 0; i < 10; i++) {
sensorTempMap.put("sensor_" + (i+1), 60 + random.nextGaussian() * 20);
}
while (running) {
for(String sensorId: sensorTempMap.keySet()) {
// 在当前温度基础上随机波动
Double newtemp = sensorTempMap.get(sensorId) + random.nextGaussian();
sensorTempMap.put(sensorId, newtemp);
ctx.collect(new SensorReading(sensorId, System.currentTimeMillis(), newtemp));
}
// 控制输出频率
Thread.sleep(1000L);
}
}
@Override
public void cancel() {
running = false;
}
}
}
Flink支持的数据类型
Java和Scala支持的数据类型,Flink都支持。
元组类型、pojo类型是最简单最灵活的类型。
Flink有一套自己的类型系统。
return (Types.TUPLE(Types.STRING,Types.SRRING));
return (new TypeHint<Tuple2<Types.STRING,Types.SRRING>>(){});
TypeHint能记录泛型中的类型细节。比上一种方式更细致具体。
转换算子 transformations
如果使用lambda表达式,应该谨防泛型擦除,可以通过上面的方法return回去当前的类型。如:.returns(new TypeHint<String>(){})
基本转换算子
- map(一对一)
- filter(一对零、一对一)
- flatMap 打散=扁平映射 ~~ map+filter(一对零,一对一,一对多)
聚合算子 Aggregation
- keyBy 逻辑上的分区(计算key的哈希值,对分区数取模运算实现分区,分区是聚合的前提)
- max:只更新关注的字段,但对象的其他字段不更新,仍保持首次出现时的内容
- maxBy:更新关注字段对应的整条内容
- reduce:规约聚合,可以实现大部分需求
用户自定义函数 UDF
自定义函数
富函数类:可以获取到运行时环境的上下文,拥有一些生命周期的方法,可以实现更复杂的功能。富函数富在多了open和close方法。特别适合数据库连接的工作。
物理分区
1. 随机分区:重新洗牌
stream.shuffle().setParallelism(4) 随机打乱,尽量均匀分配,可用来解决数据倾斜问题
2. 轮询分区:轮询发牌
stream.rebalance().setParallelism(4) 如果前后并行度不同,系统默认轮训分区
3. rescale重缩放分区:先分组,组内轮询发牌
是某些场景下,对rebalance的优化。
如:共2个TaskManager,每个TaskManager有2个slots,在每个TaskManager内部进行轮询发牌,避免网络传输数据,提高效率。
4. 广播分区:每个数据来了,会被4个分区全处理一遍
stream.broadcast().setParallelism(4)
5. 全局分区:全局合一,合成一个分区,慎用!!
stream.global().setParallelism(4)
6. 自定义重分区:可解决数据倾斜问题,平常不太用。
rebalance和rescale的区别
输出算子 sink
写入文件
StreamingFileSink<String> streamingFileSink = StreamingFileSink.<String>forRowFormat(
new Path("./output"),
new SimpleStringEncoder<>("UTF-8")
)
.withRollingPolicy( // 滚动策略,超过1G,或每隔15min,或5min无写入,保存当前文件,重开一个文件
DefaultRollingPolicy.builder()
.withMaxPartSize(1024*1024*1024)
.withRollingInterval(TimeUnit.MINUTES.toMillis(15)
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5)
)
.build();
stream.map(data -> data.toString())
.addSink(streamingFileSink);
写入Kafka
最高级别的状态一致性
stream.addSink( new FlinkKafkaProducer<String>("hadoop102:9092", "events", new SimpleStringSchema()));
stream.addSink( new FlinkKafkaProducer<String>(brokerList, topicId, new SimpleStringSchema()));
写入Redis
public class SinkTest2_Redis {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 从文件读取数据
DataStream<String> inputStream = env.readTextFile("D:\\Projects\\BigData\\FlinkTutorial\\src\\main\\resources\\sensor.txt");
// 转换成SensorReading类型
DataStream<SensorReading> dataStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
});
// 定义jedis连接配置
FlinkJedisPoolConfig config = new FlinkJedisPoolConfig.Builder()
.setHost("localhost")
.setPort(6379)
.build();
dataStream.addSink( new RedisSink<>(config, new MyRedisMapper()));
env.execute();
}
// 自定义RedisMapper
public static class MyRedisMapper implements RedisMapper<SensorReading>{
// 定义保存数据到redis的命令,存成Hash表,hset sensor_temp id temperature
@Override
public RedisCommandDescription getCommandDescription() {
return new RedisCommandDescription(RedisCommand.HSET, "sensor_temp");
}
@Override
public String getKeyFromData(SensorReading data) {
return data.getId();
}
@Override
public String getValueFromData(SensorReading data) {
return data.getTemperature().toString();
}
}
}
写入ES
旧版:
public class SinkTest3_Es {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 从文件读取数据
DataStream<String> inputStream = env.readTextFile("D:\\Projects\\BigData\\FlinkTutorial\\src\\main\\resources\\sensor.txt");
// 转换成SensorReading类型
DataStream<SensorReading> dataStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
});
// 定义es的连接配置
ArrayList<HttpHost> httpHosts = new ArrayList<>();
httpHosts.add(new HttpHost("localhost", 9200));
dataStream.addSink(new ElasticsearchSink.Builder<SensorReading>(httpHosts, new MyEsSinkFunction()).build());
env.execute();
}
// 实现自定义的ES写入操作
public static class MyEsSinkFunction implements ElasticsearchSinkFunction<SensorReading>{
@Override
public void process(SensorReading element, RuntimeContext ctx, RequestIndexer indexer) {
// 定义写入的数据source
HashMap<String, String> dataSource = new HashMap<>();
dataSource.put("id", element.getId());
dataSource.put("temp", element.getTemperature().toString());
dataSource.put("ts", element.getTimestamp().toString());
// 创建请求,作为向es发起的写入命令
IndexRequest indexRequest = Requests.indexRequest()
.index("sensor")
.type("readingdata")
.source(dataSource);
// 用index发送请求
indexer.add(indexRequest);
}
}
}
写入JDBC
旧版:
public class SinkTest4_Jdbc {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
// 从文件读取数据
// DataStream<String> inputStream = env.readTextFile("D:\\Projects\\BigData\\FlinkTutorial\\src\\main\\resources\\sensor.txt");
//
// // 转换成SensorReading类型
// DataStream<SensorReading> dataStream = inputStream.map(line -> {
// String[] fields = line.split(",");
// return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
// });
DataStream<SensorReading> dataStream = env.addSource(new SourceTest4_UDF.MySensorSource());
dataStream.addSink(new MyJdbcSink());
env.execute();
}
// 实现自定义的SinkFunction
public static class MyJdbcSink extends RichSinkFunction<SensorReading> {
// 声明连接和预编译语句
Connection connection = null;
PreparedStatement insertStmt = null;
PreparedStatement updateStmt = null;
@Override
public void open(Configuration parameters) throws Exception {
connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/test", "root", "123456");
insertStmt = connection.prepareStatement("insert into sensor_temp (id, temp) values (?, ?)");
updateStmt = connection.prepareStatement("update sensor_temp set temp = ? where id = ?");
}
// 每来一条数据,调用连接,执行sql
@Override
public void invoke(SensorReading value, Context context) throws Exception {
// 直接执行更新语句,如果没有更新那么就插入
updateStmt.setDouble(1, value.getTemperature());
updateStmt.setString(2, value.getId());
updateStmt.execute();
if( updateStmt.getUpdateCount() == 0 ){
insertStmt.setString(1, value.getId());
insertStmt.setDouble(2, value.getTemperature());
insertStmt.execute();
}
}
@Override
public void close() throws Exception {
insertStmt.close();
updateStmt.close();
connection.close();
}
}
}
新版:
Steam.addSink(JdbcSink.sink(
"Insert into clicks (user, url) values (?, ?)",
((statement, event)->{
statement.setString(1, event.user);
statement.setString(2, event.url);
}),
new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
.withUrl("jdbc:mysql://localhost:3306/test")
.withDriverName("com.mysql.jdbc.Driver")
.withUsername("root")
.withPassword("123456")
.build()
)
);
自定义输出
Flink写入HBASE
继承RichSinkFunction;
重写3个方法:open、invoke、close