1、Flink状态去重场景
在Flink运行的时候,往往是无休止的运行,在整个Flink程序运行的长河中,往往会出现很多状态的出现,那么状态的生命周期,也就是创建、使用和销毁,那么在我们写flink程序过程中,往往不需要关注flink 状态的清理,flink内部就会对我们的状态进行清理,例如我们开一个10分钟的窗口,那么在这十分钟的窗口中,这个状态也就是会发生创建、使用和销毁,那么我这里问大家一个问题?就是窗口结束后,状态会销毁吗。这里有一个场景,也就是说当我们开一个一天的窗口,计算当天的消费人数,那么这个时候一个用户可能会有多次消费,那么肯定要去重,肯定很多人想到了Redis,Flink State。这个时候有个疑问诞生了,我们用Redis是可以去重的,并且会持久化,还可以根据key来进行判断当天日期,这是很多人的一个操作,那么flink state也是可以的,我们可以在不管是process方法还是其他的一些方法中开辟一个状态,例如MapState。类似于下面的这段代码
private transient MapState<Student, Object> mapState;
private transient final Object OBJECT_VALUE = new Object();
这个时候我们可以在窗口上开辟state,将userid放入map中来进行去重,当然使用KeyedState的时候一定要key by。当然还有non-keyed state 这里我们就不细细讨论了。那么我们现在有个想法,那就是我不开窗口我怎么实现去重的操作,因为开辟一天的窗口,在数据量很大的情况下,并且重复数据很多,真实数据并不多的场景下,开辟一天的窗口,可能压力会大一些。我的想法是,在一个keyby算子之后process方法中对数据进行去重,并不开窗口,那么随着时间的流动,状态会很大,我们要的是一分钟的数据去重,下一分钟再去重下一个分钟内的,这个操作我们怎么解决这个问题?
2、StateTtlConfig的引入
可以将生存时间(TTL)分配给任何类型的keyed State。如果配置了 TTL 并且状态值已过期,则将尽最大努力清理存储的值。
为了使用statettl,必须首先构建StateTtlConfig配置对象。然后,通过以下配置,可以在任何StateDescript 中启用TTL功能:
import org.apache.flink.api.common.state.StateTtlConfig;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.time.Time;
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(1))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.build();
ValueStateDescriptor<String> stateDescriptor = new ValueStateDescriptor<>("text state", String.class);
stateDescriptor.enableTimeToLive(ttlConfig);
- Time.seconds(1) 周期过期时间
- setUpdateType 更新类型
- setStateVisibility 是否在访问state的时候返回过期值
setUpdateType:
StateTtlConfig.UpdateType.OnCreateAndWrite
- 只在创建和写的时候清除 (默认)StateTtlConfig.UpdateType.OnReadAndWrite
- 在读和写的时候清除
setStateVisibility:
StateTtlConfig.StateVisibility.NeverReturnExpired
- 从不返回过期值StateTtlConfig.StateVisibility.ReturnExpiredIfNotCleanedUp
- 如果仍然可用,则返回
在 NeverReturnExpired 的情况下,过期状态的就好像它不再存在一样,即使它未被删除。这个选项对于 TTL 之后之前的数据数据不可用。
另一个选项 ReturnExpiredIfNotCleanedUp 允许在清理之前返回数据,也就是说他ttl过期了,数据还没有被删除的时候,仍然可以访问。
那么接下来我们就看看怎么用吧:
3、实践场景
我现在有这样的一个需求,我想不开窗口在一分钟,一小时,一天的时候清理数据,也就是清理去重器。
因为是企业场景,将代码进行转换了其他业务类型不进行展示。
Hash 去重
public class HashDuplicationStrategy implements DuplicationStrategy, Serializable {
private transient MapState<Student, Object> studentSetState;
private transient MapState<TEACHERDO, Object> teacherSetState;
private transient final Object OBJECT_VALUE = new Object();
public HashDuplicationStrategy(RuntimeContext context, int cleanTime) {
StateTtlConfig stateTtlConfig = StateTtlConfig.newBuilder(Time.seconds(cleanTime)).setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.build();
MapStateDescriptor<Student, Object> studentSetStateDes =
new MapStateDescriptor<>(StateDescriptor.STUDENT_SET_NAME_PREFIX + cleanTime, Student.class, Object.class);
MapStateDescriptor<Teacher, Object> teacherSetStateDes =
new MapStateDescriptor<>(StateDescriptor.TEACHER_SET_NAME_PREFIX+ cleanTime, Teacher.class, Object.class);
studentSetStateDes.enableTimeToLive(stateTtlConfig);
teacherSetStateDes.enableTimeToLive(stateTtlConfig);
studentSetState = context.getMapState(studentSetStateDes);
teacherSetState = context.getMapState(teacherSetStateDes);
}
@Override
public boolean doDuplications(Person person) throws Exception {
if (person instanceof Student) {
Student student = (Student) person;
if (!studentSetState.contains(student)) {
studentSetState.put(student, OBJECT_VALUE);
return true;
}
}
if (person instanceof TEACHERDO) {
Teacher teacher = (Teacher) person;
if (!teacherSetState.contains(teacher)) {
teacherSetState.put(teacher, OBJECT_VALUE);
return true;
}
}
return false;
}
}
BoomFilter 去重
public class BloomfilterDuplicationStrategy implements DuplicationStrategy {
private transient ValueState<BloomFilter> studentBmState;
private transient ValueState<BloomFilter> teacherBmState;
private int defaultExpect = 1000000;
private double defaultFpp = 0.01;
public BloomfilterDuplicationStrategy(RuntimeContext context, int cleanTime) {
StateTtlConfig stateTtlConfig = StateTtlConfig.newBuilder(Time.seconds(cleanTime)).setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.build();
ValueStateDescriptor<BloomFilter> studentBmStateDes =
new ValueStateDescriptor<BloomFilter>(StateDescriptor.STUDENT_BM_NAME_PREFIX + cleanTime, BloomFilter.class);
ValueStateDescriptor<BloomFilter> teacherBmStateDes =
new ValueStateDescriptor<BloomFilter>(StateDescriptor.TEACHER_BM_NAME_PREFIX + cleanTime, BloomFilter.class);
studentBmStateDes.enableTimeToLive(stateTtlConfig);
teacherBmStateDes.enableTimeToLive(stateTtlConfig);
studentBmState = context.getState(studentBmStateDes);
teacherBmState = context.getState(teacherBmStateDes);
}
@Override
public boolean doDuplications(Person person) throws Exception {
BloomFilter studentBloomFilter = studentBmState.value();
if (studentBloomFilter == null) {
studentBloomFilter = BloomFilter.create(Funnels.integerFunnel(), defaultExpect, defaultFpp);
studentBmState.update(studentBloomFilter);
}
BloomFilter teacherBloomFilter = teacherBmState.value();
if (teacherBloomFilter == null) {
teacherBloomFilter = BloomFilter.create(Funnels.integerFunnel(), defaultExpect, defaultFpp);
teacherBmState.update(teacherBloomFilter);
}
if (Person instanceof Student) {
Student student = (Student) person;
//可以判断一定不包含
if (!studentBloomFilter.mightContain(student.hashCode())) {
studentBloomFilter.put(student.hashCode());
studentBmState.update(studentBloomFilter);
return true;
}
}
if (person instanceof Teacher) {
Teacher teacher = (Teacher) person;
//可以判断一定不包含
if (!teacherBloomFilter.mightContain(teacher.hashCode())) {
teacherBloomFilter.put(teacher.hashCode());
teacherBmState.update(teacherBloomFilter);
return true;
}
}
return false;
}
}