我们知道,Flink是一个默认就有状态的分析引擎,为避免Task在处理过程中挂掉了,而导致内存中的数据丢失,Flink引入了State和CheckPoint机制,其中State就是Flink的一种基于内存的状态机制,Flink提供了两种基本的状态类型。

一、基本状态Keyed State与Operator State

1、Keyed State

Keyed State:顾名思义就是基于KeyedStream上的状态,这个状态是跟特定的Key绑定的。KeyedStream流上的每一个Key,都对应一个State。意味着这些状态仅可在 KeyedStream 上使用,可以通过 stream.keyBy(...) 得到 KeyedStream。Flink针对KeyedState提供了以下可以保存State的数据结构:

(1)ValueState<T>

保存一个可以更新和检索的值(如上所述,每个值都对应到当前的输入数据的key,因此算子接收到的每个key都可能对应一个值)。这个值可以通过update(T)进行更新,通过T value()进行检索。

(2)ListState<T>

保存一个元素的列表。可以往这个列表中追加数据,并在当前的列表上进行检索。可以通过add(T)或者addAll(List<T>)进行添加元素,通过Iterable<T> get()获得整个列表。还可以通过update(List<T>)覆盖当前的列表。

(3)MapState<UK,UV>

维护了一个映射列表。你可以添加键值对到状态中,也可以获得反映当前所有映射的迭代器。使用put(UK,UV)或者putAll(Map<UK,UV>)添加映射。使用get(UK)检索特定key。使用entries(),keys()和values()分别检索映射、键和值的可迭代视图

(4)ReducingState<T>

保存一个单值,表示添加到状态的所有值的聚合。接口与ListState类似,但使用add(T)增加元素,会使用提供ReduceFunction进行聚合。

(5)AggregatingState<IN,OUT>

保留一个单值,表示添加到状态的所有值的聚合。和ReducingState相反的是,聚合类型可能与添加到状态的元素的类型不同。接口与ListState类似,但使用add(IN)添加的元素会用指定的AggregateFunction进行聚合。

(6)FoldingState<T,ACC>

保留一个单值,表示添加到状态的所有值的聚合。与ReducingState相反,聚合类型可能与添加到状态的元素类型不同。接口与ListState类似,但使用add(T)添加的元素会用指定的FoldFunction折叠成聚合值

2、Operator State

Operator State与Key无关,而是与Operator绑定,整个Operator只对应一个State。比如:Flink中的KafkaConnector就使用了Operator State,它会在每个Connector实例中,保存该实例消费Topic的所有(partition,offset)映射。如下图:

二、状态存在形式managed与raw

Keyed State和Operator State分别有两种存在形式:managed and raw.

1、Managed State

由Flink运行时控制的数据结构表示,比如内部的hashtable或者RocksDB。比如“ValueState”,“ListState”等。Flinkruntime会对这些状态进行编码并写入checkpoint。

2、Raw State

Raw类型的State则保存在算子自己的数据结构中。checkpoint的时候,Flink并不知晓具体的内容,仅仅写入一串字节序列到checkpoint。

所有datastream的function都可以使用managed state,但是raws tate则只能在实现算子的时候使用。由于Flink可以在修改并发时更好的分发状态数据,并且能够更好的管理内存,因此官方建议使用managed state(而不是rawstate)。

如果使用managed state做需要自定义的序列化逻辑,为了后续的兼容性,不要改变Flink的默认序列化方式

三、状态有效期 (TTL)

任何类型的 keyed state 都可以有 有效期 (TTL),如果配置了 TTL 且状态值已过期,则会尽最大可能清除对应的值。所有状态类型都支持单元素的TTL。 这意味着列表元素和映射元素将独立到期。

在使用状态 TTL 前,需要先构建一个配置StateTtlConfig对象。 然后把配置传递到 state descriptor 中启用 TTL 功能:

import org.apache.flink.api.common.state.StateTtlConfig;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.time.Time;

StateTtlConfig ttlConfig = StateTtlConfig
    .newBuilder(Time.seconds(1))
    .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
    .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
    .build();
//如ValueState使用    
ValueStateDescriptor<String> stateDescriptor = new ValueStateDescriptor<>("text state", String.class);
stateDescriptor.enableTimeToLive(ttlConfig);

TTL配置有以下几个选项:
newBuilder参数表示数据的有效期,是必选项

TTL的更新策略有两个:
OnCreateAndWrite:仅在创建和写入时更新(默认策略)
OnReadAndWrite:读取和写入时时更新

TTL的数据在过期但还未被清理时的可见性配置如下:
NeverReturnExpired:过期数据就像不存在一样,不管是否被物理删除,都不返回过期数据(默认策略)
ReturnExpiredIfNotCleanedUp:会返回过期但未清理的数据,在数据被物理删除前都会返回
 

四、常见状态使用案例

这里以MapState<UK,UV>为例,存储一对映射关系,常常适用于根据某字段去重

(1)MapState<UK,UV>

输入orderLog.txt:

{"pin":"zhansan","orderId":"20201011231245423","skuId":"1226354","priceType":"new","requestTime":"1599931959000"}
{"pin":"lisi","orderId":"20201011231254678","skuId":"1226322","priceType":"normal","requestTime":"1599931359024"}
{"pin":"zhansan","orderId":"20201011231212768","skuId":"1226324","priceType":"back","requestTime":"1599931359011"}
{"pin":"lisi","orderId":"20201011231234567","skuId":"1226351","priceType":"normal","requestTime":"1599932029000"}
{"pin":"wanwu","orderId":"20201011231245424","skuId":"1226354","priceType":"new","requestTime":"1599931959000"}

 OrderLog实体类:

@Data
class OrderLog {
	private String orderId;
	
	private String skuId;
	
	private String priceType;
	
	private Long requestTime;
	
	private Long sum = 1L;
	
	private String pin;
}

统计今日下单用户数量,根据用户账号pin字段去重

关键代码如下:

class OutPutWindowProcessFunction extends ProcessWindowFunction <OrderLog, OrderLog, String, TimeWindow>  {
	private static final long serialVersionUID = -6632888020403733197L;

	private MapState<String, Integer> mapState = null;
	
	@Override
	public void open(Configuration parameters) throws Exception {
		//构造TTL过期配置
		StateTtlConfig stateTtlConfig = StateTtlConfig.newBuilder(org.apache.flink.api.common.time.Time.days(1))
				.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
                .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
                .build();
		
		 MapStateDescriptor<String, Integer> mapStateDescriptor = new MapStateDescriptor<>(
					 "day-pin-state", 
					 TypeInformation.of(String.class), 
					 TypeInformation.of(Integer.class)
				 );
		 //设置过期配置
		 mapStateDescriptor.enableTimeToLive(stateTtlConfig);
		 //初始化MapState对象
	     mapState = getRuntimeContext().getMapState(mapStateDescriptor);
	}
	
	@Override
	public void process(String arg0, ProcessWindowFunction<OrderLog, OrderLog, String, TimeWindow>.Context ctx,
			Iterable<OrderLog> it, Collector<OrderLog> collect) throws Exception {
		Iterator<OrderLog> iterator = it.iterator();
		while (iterator.hasNext()) {
			OrderLog orderLog = iterator.next();
			//不存在记录到mapState,并输出给下一个算子
			if(mapState.get(orderLog.getPin()) == null) {
				mapState.put(orderLog.getPin(), 1);
				collect.collect(orderLog);
			}
		}
	}
}

完整代码如下:

import java.util.Iterator;

import org.apache.flink.api.common.functions.RichFlatMapFunction;
import org.apache.flink.api.common.state.MapState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.state.StateTtlConfig;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor;
import org.apache.flink.streaming.api.functions.windowing.ProcessWindowFunction;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;
import org.apache.flink.util.OutputTag;

import com.alibaba.fastjson.JSON;

import lombok.Data;

public class TestSideOutputStream {
	public static final OutputTag<OrderLog> LATE_OUTPUT_TAG = new OutputTag<>("LATE_OUTPUT_TAG", TypeInformation.of(OrderLog.class));
     
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        DataStreamSource<String> textDataSteam = env.readTextFile("C:\\Users\\admin\\Desktop\\orderLog.txt");
        
        SingleOutputStreamOperator<OrderLog> dayPvDataStream = textDataSteam
        		.flatMap(new OutPutMapFunction())
        		.assignTimestampsAndWatermarks(new AssignedWaterMarks(Time.seconds(3))) //添加watermark,设置3秒延迟
        		.keyBy(OrderLog::getPin)
        		.window(TumblingEventTimeWindows.of(Time.minutes(1), Time.seconds(5)))  //5秒统计一次最近1分钟商品下单排名
        		.allowedLateness(Time.minutes(30))      //允许数据迟到30分钟
        		.sideOutputLateData(TestSideOutputStream.LATE_OUTPUT_TAG) //标记迟到数据
//        		.aggregate(new CountAggregateFunction(), new OutResultWindowFunction());//聚合累加,并输出
        		.process(new OutPutWindowProcessFunction());
        
        dayPvDataStream.addSink(new SideOutPutSinkFunction());//sink持久化操作
        
        dayPvDataStream.getSideOutput(TestSideOutputStream.LATE_OUTPUT_TAG) //通过标签获取延迟数据
        .keyBy(OrderLog::getPin)
        .window(TumblingEventTimeWindows.of(Time.seconds(3)))  //对迟到数据3秒计算一次
        .process(new OutPutWindowProcessFunction())
        .addSink(new SideOutPutSinkFunction2());//对迟到的的数据,增量计算并sink持久化操作
        
        env.execute();
    }
}

/**
 * 水位线,保证按事件时间处理
 */
class AssignedWaterMarks extends BoundedOutOfOrdernessTimestampExtractor<OrderLog> {
	private static final long serialVersionUID = 2021421640499388219L;
	
	public AssignedWaterMarks(Time maxOutOfOrderness) {
		super(maxOutOfOrderness);
	}

    @Override
    public long extractTimestamp(OrderLog orderLog) {
        return orderLog.getRequestTime();
    }
}

/**
 * map转换输出
 */
class OutPutMapFunction extends RichFlatMapFunction<String, OrderLog> {
	private static final long serialVersionUID = -6478853684295335571L;

	@Override
	public void flatMap(String value, Collector<OrderLog> out) throws Exception {
		OrderLog orderLog = JSON.parseObject(value, OrderLog.class);
		out.collect(orderLog);
	}
	
}

/**
 * 窗口函数
 */
class OutPutWindowProcessFunction extends ProcessWindowFunction <OrderLog, OrderLog, String, TimeWindow>  {
	private static final long serialVersionUID = -6632888020403733197L;

	private MapState<String, Integer> mapState = null;
	
	@Override
	public void open(Configuration parameters) throws Exception {
		//构造TTL过期配置
		StateTtlConfig stateTtlConfig = StateTtlConfig.newBuilder(org.apache.flink.api.common.time.Time.days(1))
				.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
                .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
                .build();
		
		 MapStateDescriptor<String, Integer> mapStateDescriptor = new MapStateDescriptor<>(
					 "day-pin-state", 
					 TypeInformation.of(String.class), 
					 TypeInformation.of(Integer.class)
				 );
		 //设置过期配置
		 mapStateDescriptor.enableTimeToLive(stateTtlConfig);
		 //初始化MapState对象
	     mapState = getRuntimeContext().getMapState(mapStateDescriptor);
	}
	
	@Override
	public void process(String arg0, ProcessWindowFunction<OrderLog, OrderLog, String, TimeWindow>.Context ctx,
			Iterable<OrderLog> it, Collector<OrderLog> collect) throws Exception {
		Iterator<OrderLog> iterator = it.iterator();
		while (iterator.hasNext()) {
			OrderLog orderLog = iterator.next();
			//不存在记录到mapState,并输出给下一个算子
			if(mapState.get(orderLog.getPin()) == null) {
				mapState.put(orderLog.getPin(), 1);
				collect.collect(orderLog);
			}
		}
	}
}

/**
 * sink函数
 */
class SideOutPutSinkFunction extends RichSinkFunction<OrderLog>  {
	private static final long serialVersionUID = -6632888020403733197L;
	
	@Override
	public void invoke(OrderLog orderLog, Context context) throws Exception {
		//做自己的存储计算逻辑,如redis的increase进行累加
		System.out.println(orderLog.getPin() +"="+ orderLog.getSum());
	}
	
}
/**
 * sink函数
 */
class SideOutPutSinkFunction2 extends RichSinkFunction<OrderLog>  {
	private static final long serialVersionUID = -6632888020403733197L;
	
	@Override
	public void invoke(OrderLog orderLog, Context context) throws Exception {
		//做自己的存储计算逻辑,如redis的increase进行累加
//		System.out.println(orderLog.getPin() +"===="+ orderLog.getSum());
	}
	
}

 Sink中输出如下:

lisi=1
wanwu=1
zhansan=1

(2)ValueState<Tuple2<Long, Long>>

下边是官方的一个示例:实现了一个简单的计数窗口。 我们把元组的第一个元素当作 key(在示例中都 key 都是 “1”)。 该函数将出现的次数以及总和存储在 “ValueState” 中。 一旦出现次数达到 2,则将平均值发送到下游,并清除状态重新开始

import org.apache.flink.api.common.functions.RichFlatMapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class TestSideOutputStream3 {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        	
     // this can be used in a streaming program like this (assuming we have a StreamExecutionEnvironment env)
        env.fromElements(Tuple2.of(1L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L), Tuple2.of(1L, 4L), Tuple2.of(1L, 2L))
                .keyBy(0) 按第一个元素分组
                .flatMap(new CountWindowAverage())
                .print();

        env.execute();
    }
}

class CountWindowAverage extends RichFlatMapFunction<Tuple2<Long, Long>, Tuple2<Long, Long>> {

    /**
	 * 
	 */
	private static final long serialVersionUID = -8115459900572098047L;
	
	/**
     * The ValueState handle. The first field is the count, the second field a running sum.
     */
    private transient ValueState<Tuple2<Long, Long>> sum;

    @Override
    public void open(Configuration config) {
    	
        ValueStateDescriptor<Tuple2<Long, Long>> descriptor =
                new ValueStateDescriptor<>(
                        "average", // the state name
                        TypeInformation.of(new TypeHint<Tuple2<Long, Long>>() {}) // type information
                        ); // default value of the state, if nothing was set
        sum = getRuntimeContext().getState(descriptor);
    }
    
    @Override
    public void flatMap(Tuple2<Long, Long> input, Collector<Tuple2<Long, Long>> out) throws Exception {

        // access the state value
        Tuple2<Long, Long> currentSum = sum.value();

        if(currentSum == null) {
        	currentSum = new Tuple2<>(0L, 0L); 
        }
       
        // update the count
        currentSum.f0 += 1;

        // add the second field of the input value
        currentSum.f1 += input.f1;

        // update the state
        sum.update(currentSum);

        // if the count reaches 2, emit the average and clear the state
        if (currentSum.f0 >= 2) {
            out.collect(new Tuple2<>(input.f0, currentSum.f1 / currentSum.f0));
            sum.clear();
        }
    }

}