背景

这几天看到Flink学习群问了一个问题,就是他们想实时监控用户session行为轨迹,如果当前session下用户点击了A事件,如果1小时内用户没有点击B事件,实时流输出C事件

拿电商页面举例子

Flink实时监控用户session轨迹触发推荐_flink

Flink相关知识点

  •  
1:flink状态,由于按session聚合,需要使用keyby+process函数2:通过flink的KeyedProcessFunction内部实现状态管理3:然后运用KeyedProcessFunction中的定时触发器onTimer,实时定时判断

注意点:

  •  
TimerService 在内部维护两种类型的定时器(处理时间和事件时间定时器)并排队执行。TimerService 会删除每个键和时间戳重复的定时器,即每个键在每个时间戳上最多有一个定时器。如果为同一时间戳注册了多个定时器,则只会调用一次onTimer()方法。

废话不多说,直接上代码

kafka代码:

  •  
import java.util.Propertiesimport kafka.producer.{KeyedMessage, Producer, ProducerConfig}import scala.io.Sourceobject kafkaProduct {  def test1() = {    val brokers_list = "localhost:9092"    val topic = "flink2"    val props = new Properties()    props.put("group.id", "test-flink")    props.put("metadata.broker.list",brokers_list)    props.put("serializer.class", "kafka.serializer.StringEncoder")    props.put("num.partitions","4")    val config = new ProducerConfig(props)    val producer = new Producer[String, String](config)    var num = 0    for (line <- Source.fromFile("/Users/huzechen/Downloads/flinktest/src/main/resources/cep1").getLines) {      val aa = scala.util.Random.nextInt(3).toString      println(aa)      producer.send(new KeyedMessage(topic,aa,line))    }    producer.close()  }  def main(args: Array[String]): Unit = {    test1()  }}

kafka测试数据:自己模拟写入

  •  
{"session_id":"0000015","event_id":"A"}{"session_id":"0000016","event_id":"A"}

flink代码块:由于好多新人想让我代码多点注释,今天我就满足大家的意愿写一波详细的

  •  
import java.text.SimpleDateFormatimport java.utilimport java.util.{Date, Properties}
import com.alibaba.fastjson.{JSON, JSONObject}import org.apache.flink.api.common.serialization.SimpleStringSchemaimport org.apache.flink.api.common.typeinfo.TypeInformationimport org.apache.flink.api.java.tuple.Tupleimport org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, createTypeInformation}import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer08import org.apache.flink.streaming.util.serialization.KeyedDeserializationSchemaimport org.apache.flink.api.common.state.{StateTtlConfig, ValueState, ValueStateDescriptor}import org.apache.flink.api.common.time.Timeimport org.apache.flink.configuration.Configurationimport org.apache.flink.streaming.api.TimeCharacteristicimport org.apache.flink.util.Collectorimport org.apache.flink.streaming.api.functions._import org.apache.flink.streaming.api.watermark.Watermarkobject SessionIdKeyedProcessFunction { class MyTimeTimestampsAndWatermarks extends AssignerWithPunctuatedWatermarks[(String,String)] with Serializable{    //生成时间戳 override def extractTimestamp(element: (String,String), previousElementTimestamp: Long): Long = { System.currentTimeMillis() }    //获取wrtermark override def checkAndGetNextWatermark(lastElement: (String, String), extractedTimestamp: Long): Watermark = { new Watermark(extractedTimestamp -1000) } } case class SessionInfo(session_id : String,event_id: String, timestamp:Long) def main(args: Array[String]): Unit = {    val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) val properties = new Properties() //kafka位置 老版本的 kafka是配置zookeeper地址 properties.setProperty("bootstrap.servers","localhost:9092") properties.setProperty("zookeeper.connect","localhost:2181") val topic = "flink2" properties.setProperty("group.id", "test-flink")    //初始化读取kafka的实时流 val consumer = new FlinkKafkaConsumer08(topic,new SimpleStringSchema(),properties) val text: DataStream[Tuple2[String, String]] = env.addSource(consumer).map(line => { val json = JSON.parseObject(line)      //返回用户session_id,用户事件event_id Tuple2(json.get("session_id").toString,json.get("event_id").toString) }).assignTimestampsAndWatermarks(new MyTimeTimestampsAndWatermarks()) text.keyBy(0) .process(new SessionIdTimeoutFunction()).setParallelism(1).print() env.execute()      //由于是按key聚合,创建每个key的状态 key=session_id     //实现KeyedProcessFunction内的onTime方法 class SessionIdTimeoutFunction extends KeyedProcessFunction[Tuple, (String, String), (String,String)]{ private var state: ValueState[SessionInfo] = _ private var sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") override def open(parameters: Configuration): Unit ={ super.open(parameters) val config = StateTtlConfig.newBuilder(Time.minutes(5)) .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired) .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite) .build() val valueStateDescriptor =new ValueStateDescriptor("myState1", classOf[SessionInfo]) valueStateDescriptor.enableTimeToLive(config) state = getRuntimeContext.getState(valueStateDescriptor) } override def processElement(message: (String, String), ctx: KeyedProcessFunction[Tuple, (String, String), (String, String)]#Context,                                  out: Collector[(String,String)]) = { //用户sessionid用户行为轨迹 if(state.value() == null){ val timeStamp = ctx.timestamp()          //输出当前实时流事件,这次没有考虑事件先后顺序          //如果要对事件先后顺序加一下限制,state需要重新设计          //这次就简单实现一下原理,后边我再写一个针对顺序的代码          out.collect((message)) //如果状体是A,设置下次回调的时间。5秒之后回调 if(message._2 =="A"){ ctx.timerService.registerEventTimeTimer(timeStamp+5000) state.update(SessionInfo(message._1,message._2,timeStamp)) } } //如果发现当前sessionid下有B行为,就更新B println("当前时间:"+sdf.format(new Date(ctx.timestamp))) if(message._2 == "B"){ state.update(SessionInfo(message._1,message._2,ctx.timestamp())) } }
override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Tuple, (String, String), (String, String)]#OnTimerContext, out: Collector[(String, String)]): Unit = {        //如果当前key,5秒之后,没有触发B事件        //并且事件一定到了触发的事件点,就输出C事件 println("onTimer触发时间:状态记录的时间_触发时间"+sdf.format(new Date(state.value().timestamp)) + "_" +sdf.format(new Date(timestamp))) if(state.value().event_id !="B" && state.value().timestamp +5000 == timestamp){ out.collect(("SessionID为:"+ state.value().session_id,"由于5s内没有看到B触发C时间")) }      } } }}

数据验证结果:大家可以看到,当session_id收到A事件5s之后并且没有收到B事件,每个session_id都会触发C事件。

  •  
当前时间:2019-08-29 13:11:41当前时间:2019-08-29 13:11:412> (0000010,A)1> (000009,A)当前时间:2019-08-29 13:12:01当前时间:2019-08-29 13:12:014> (0000012,A)3> (0000011,A)当前时间:2019-08-29 13:12:13当前时间:2019-08-29 13:12:135> (0000013,A)6> (0000014,A)当前时间:2019-08-29 13:12:24当前时间:2019-08-29 13:12:24onTimer触发时间:状态记录的时间_触发时间2019-08-29 13:11:41_2019-08-29 13:11:46onTimer触发时间:状态记录的时间_触发时间2019-08-29 13:11:41_2019-08-29 13:11:46onTimer触发时间:状态记录的时间_触发时间2019-08-29 13:12:01_2019-08-29 13:12:06onTimer触发时间:状态记录的时间_触发时间2019-08-29 13:12:01_2019-08-29 13:12:063> (SessionID为:0000011,由于5s内没有看到B触发C时间)4> (SessionID为:0000012,由于5s内没有看到B触发C时间)8> (0000015,A)7> (0000016,A)2> (SessionID为:0000010,由于5s内没有看到B触发C时间)1> (SessionID为:000009,由于5s内没有看到B触发C时间)