1、需求 

flink 如何实现数据统计 flink每分钟统计_flink

2、需求输出每个窗口访问量最大的5个地址,所以一定是无状态的输出,有两种办法

(1)、keyby后自己使用processfunction自己定义state,否则是有状态的输出

(2)、window后➕windowAll方法

3、正常情况下不考虑乱序的时候(没有窗口延迟关闭1min和延迟数据的侧输出流)

event时间窗口滑动窗口10min,滚动5s输出

程序一启动,就会创建所有的窗口,但是有数据才会输出  [10:15:45-10:25:50), [10:15:50-10:25:55), [10:15:55-10:25:60)

watermark:延时1s

定时器:windowEnd+1s(但是定时器的加1s输出需要依赖watermark的更新,所以在watermark+2s才会触发)

aggDataStream需要窗口触发才会输出

输入的原始数据

输入(在 [10:15:45-10:25:50) ):89.101 - - 17/05/2015:10:25:49 +0000 GET /presedent 

输出:dataStream> ApacheLogEvent(89.101,-,1421461549000,GET,/presedent)

输入:89.101 - - 17/05/2015:10:25:50 +0000 GET /presedent

输出:dataStream> ApacheLogEvent(89.101,-,1421461550000,GET,/presedent)

输入:89.101 - - 17/05/2015:10:25:51 +0000 GET /presedent

输出(watermark对应的[10:15:45-10:25:50)窗口触发):

dataStream> ApacheLogEvent(89.101,-,1421461551000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,1)

输入:

89.101 - - 17/05/2015:10:25:52 +0000 GET /presedent

输出(定时器触发):

dataStream> ApacheLogEvent(89.101,-,1421461552000,GET,/presedent)
top N> ====================================
时间: 2015-01-17 10:25:50.0
No1:  页面url=/presedent  浏览量=1
====================================

输入:10:25:56秒的数据时 [10:15:50-10:25:55)窗口触发,会输出agg流结果

输入:89.101 - - 17/05/2015:10:25:57 +0000 GET /presedent

输出(定时器触发):

dataStream> ApacheLogEvent(89.101,-,1421461557000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461555000,4)
top N> ====================================
时间: 2015-01-17 10:25:55.0
No1:  页面url=/presedent  浏览量=4   包含了[10:15:50-10:25:55)的所有数据
====================================

4、考虑乱序的时候(包含窗口延迟关闭1min和延迟数据的侧输出流)

乱序的三大保证

(1)、watermark的使用

(2)、allowlateness方法使用,窗口延迟关闭,延迟数据依然会进行增量聚合

(3)、输出到侧输出流

event时间窗口滑动窗口10min,滚动5s输出

  [10:15:45-10:25:50), [10:15:50-10:25:55), [10:15:55-10:25:60)

watermark:延时1s

定时器:windowEnd+1s(但是定时器的加1s输出需要依赖watermark的更新,所以在watermark+2s才会触发)

alowLatenes 为1min

侧输出流为:late

aggDataStream需要窗口触发才会输出

输入:

输入:89.101 - - 17/05/2015:10:25:49 +0000 GET /presedent
输入:89.101 - - 17/05/2015:10:25:50 +0000 GET /presedent
输入:89.101 - - 17/05/2015:10:25:51 +0000 GET /presedent
输入:89.101 - - 17/05/2015:10:25:52 +0000 GET /presedent
前面这4条记录和上述基本一直,除了窗口延迟1min后才会关闭
输入:89.101 - - 17/05/2015:10:25:46 +0000 GET /presedent
输出:根据之前的输入watermark已经为10:25:51了,所以agg触发了,但是定时器需要新的watermark更新才会触发。
dataStream> ApacheLogEvent(89.101,-,1421461546000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,2)


输入:89.101 - - 17/05/2015:10:25:53 +0000 GET /presedent


输出更新了watermark,所以定时器触发了:

dataStream> ApacheLogEvent(89.101,-,1421461553000,GET,/presedent)
top N> ====================================
时间: 2015-01-17 10:25:50.0
No1:  页面url=/presedent  浏览量=2
====================================

输入:

89.101 - - 17/05/2015:10:25:31 +0000 GET /presedent


输出:原因是允许1min迟到数据,并且有数据,所以输出了,原来没有输出是因为之前没有数据,所以没输出

dataStream> ApacheLogEvent(89.101,-,1421461531000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,3)
aggStream> PageViewCount(/presedent,1421461545000,1)
aggStream> PageViewCount(/presedent,1421461540000,1)
aggStream> PageViewCount(/presedent,1421461535000,1)

当前wartermark为10:25:52s,那么[,10:24:50]的窗口一定已经关闭了,那我们输入[,10:24:49]仍然不会输出到侧输出流里面,因为这条记录还属于[,10:24:55]等多个的窗口,该数据还是会被收进去的,输入侧输出流里面的数据需要该数据属于的所有窗口都已经关闭了,那我们输入一个10min之前的数据就会输出到侧输出流了,因为它的窗口都关闭了。

输入:

89.101 - - 17/05/2015:10:14:51 +0000 GET /presedent

 输出:侧输出流就会输出了

5、代码

package flinkProject

import java.sql.Timestamp
import java.text.SimpleDateFormat

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.configuration.ConfigOptions.ListConfigOptionBuilder
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

//事件类
case class ApacheLogEvent(ip: String, userid: String, timestamp: Long, methord: String, url: String)

//窗口聚合结果
case class PageViewCount(url: String, windowEnd: Long, count: Long)

object HotNewworkLogFlow {
  def main(args: Array[String]): Unit = {
    val executionEnvironment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    executionEnvironment.setParallelism(1)
    executionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) //watermark周期性生成,默认是200ms
    //    executionEnvironment.getConfig.setAutoWatermarkInterval(500) //watermark周期性生成,默认是200ms,可以修改为500ms
    val stream2: DataStream[String] = executionEnvironment.socketTextStream("127.0.0.1", 1111)
    val transforStream: DataStream[ApacheLogEvent] = stream2.map(data => {
      val tmpList = data.split(" ")
      val simpleDateFormat = new SimpleDateFormat("dd/mm/yy:HH:mm:ss")
      val ts = simpleDateFormat.parse(tmpList(3)).getTime
      ApacheLogEvent(tmpList(0), tmpList(1), ts, tmpList(5), tmpList(6))
    })

    //增加watermark配置.punctated代表点状的,periodic代表周期性的(一般用这个)
    val transforEventStream: DataStream[ApacheLogEvent] = transforStream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[ApacheLogEvent](Time.seconds(1)) {
      override def extractTimestamp(t: ApacheLogEvent) = t.timestamp
    })


    val aggResult: DataStream[PageViewCount] = transforEventStream
      .filter(_.methord == "GET")
      .keyBy(_.url)
      .timeWindow(Time.minutes(10), Time.seconds(5))
      .allowedLateness(Time.minutes(1))
      .sideOutputLateData(new OutputTag[ApacheLogEvent]("late"))
      .aggregate(new PageCountAgg, new PageCountWindow)

    transforStream.print("dataStream")
    aggResult.print("aggStream")
    aggResult.getSideOutput(new OutputTag[ApacheLogEvent]("late")).print()
    aggResult.keyBy(_.windowEnd).process(new HotProcessFunction(5)).print("top N")

    executionEnvironment.execute("transform")

  }
}

class PageCountAgg extends AggregateFunction[ApacheLogEvent, Long, Long] {
  override def createAccumulator(): Long = 0l

  override def add(value: ApacheLogEvent, accumulator: Long): Long = accumulator + 1

  override def getResult(accumulator: Long): Long = accumulator

  override def merge(a: Long, b: Long): Long = a + b
}


class PageCountWindow extends WindowFunction[Long, PageViewCount, String, TimeWindow] {
  override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[PageViewCount]): Unit = {
    out.collect(PageViewCount(key, window.getEnd, input.iterator.next()))

  }
}


class HotProcessFunction(topSize: Int) extends KeyedProcessFunction[Long, PageViewCount, String] {

  var itemState: ListState[PageViewCount] = _

  override def open(parameters: Configuration): Unit = {
    super.open(parameters)
    // 命名状态变量的名字和状态变量的类型
    val itemsStateDesc = new ListStateDescriptor[PageViewCount]("itemState-state", classOf[PageViewCount])
    // 定义状态变量
    itemState = getRuntimeContext.getListState(itemsStateDesc)
  }

  override def processElement(i: PageViewCount, context: KeyedProcessFunction[Long, PageViewCount, String]#Context, collector: Collector[String]): Unit = {
    itemState.add(i)
    context.timerService().registerEventTimeTimer(i.windowEnd + 1)

  }

  override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Long, PageViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
    val allPageViewCount: ListBuffer[PageViewCount] = ListBuffer.empty
    val iterable = itemState.get().iterator()
    while (iterable.hasNext) {
      allPageViewCount += iterable.next()
    }

    itemState.clear()

    val sortedItems = allPageViewCount.sortBy(_.count)(Ordering.Long.reverse).take(topSize)

    // 将排名信息格式化成 String, 便于打印
    val result: StringBuilder = new StringBuilder
    result.append("====================================\n")
    result.append("时间: ").append(new Timestamp(timestamp - 1)).append("\n")

    for (i <- sortedItems.indices) {
      val currentItem: PageViewCount = sortedItems(i)
      // e.g.  No1:  商品ID=12224  浏览量=2413
      result.append("No").append(i + 1).append(":")
        .append("  页面url=").append(currentItem.url)
        .append("  浏览量=").append(currentItem.count).append("\n")
    }
    result.append("====================================\n\n")
    // 控制输出频率,模拟实时滚动结果
    Thread.sleep(1000)
    out.collect(result.toString)


  }
}

6、原始数据:

没有alowtness和侧输出的测试数据
89.101 - - 17/05/2015:10:25:49 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:50 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:51 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:52 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:56 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:57 +0000 GET /presedent
89.101 - - 17/05/2015:10:26:02 +0000 GET /presedent

有alowtness和侧输出的测试数据
89.101 - - 17/05/2015:10:25:49 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:50 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:51 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:52 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:46 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:53 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:31 +0000 GET /presedent
89.101 - - 17/05/2015:10:14:51 +0000 GET /presedent

7、问题:输入迟到数据不能和历史数据重新排序输出

现在解决了乱序问题后,还有个问题,当有1min内的迟到数据到来时,会重新触发agg方法进行聚合,输出一个agg的结果,例如(url1,3)变成了(url1,4),这个时候我们要用(url1,4)替换掉(url1,3)并且进行重新排序,所以要保留这个窗口的所有数据在state中,在该窗口的endTime+1min后再清空状态

所以我们定义mapState和一个新的endTime+1min的定时器

输入迟到数据会和历史数据重新排序输出

输出结果如下:

dataStream> ApacheLogEvent(89.101,-,1421461549000,GET,/presedent)
dataStream> ApacheLogEvent(89.101,-,1421461546000,GET,/pre)
dataStream> ApacheLogEvent(89.101,-,1421461551000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,1)
aggStream> PageViewCount(/pre,1421461550000,1)
dataStream> ApacheLogEvent(89.101,-,1421461552000,GET,/presedent)
top N> ====================================
时间: 2015-01-17 10:25:50.0
No1:  页面url=/pre  浏览量=1
No2:  页面url=/presedent  浏览量=1
====================================


dataStream> ApacheLogEvent(89.101,-,1421461546000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,2)
dataStream> ApacheLogEvent(89.101,-,1421461553000,GET,/presedent)
top N> ====================================
时间: 2015-01-17 10:25:50.0
No1:  页面url=/presedent  浏览量=2
No2:  页面url=/pre  浏览量=1
====================================

优化代码如下: 

package flinkProject

import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.util
import java.util.Map

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor, MapState, MapStateDescriptor}
import org.apache.flink.configuration.ConfigOptions.ListConfigOptionBuilder
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

//事件类
case class ApacheLogEvent(ip: String, userid: String, timestamp: Long, methord: String, url: String)

//窗口聚合结果
case class PageViewCount(url: String, windowEnd: Long, count: Long)

object HotNewworkLogFlow {
  def main(args: Array[String]): Unit = {
    val executionEnvironment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    executionEnvironment.setParallelism(1)
    executionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) //watermark周期性生成,默认是200ms
    //    executionEnvironment.getConfig.setAutoWatermarkInterval(500) //watermark周期性生成,默认是200ms,可以修改为500ms
    val stream2: DataStream[String] = executionEnvironment.socketTextStream("127.0.0.1", 1111)
    val transforStream: DataStream[ApacheLogEvent] = stream2.map(data => {
      val tmpList = data.split(" ")
      val simpleDateFormat = new SimpleDateFormat("dd/mm/yy:HH:mm:ss")
      val ts = simpleDateFormat.parse(tmpList(3)).getTime
      ApacheLogEvent(tmpList(0), tmpList(1), ts, tmpList(5), tmpList(6))
    })

    //增加watermark配置.punctated代表点状的,periodic代表周期性的(一般用这个)
    val transforEventStream: DataStream[ApacheLogEvent] = transforStream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[ApacheLogEvent](Time.seconds(1)) {
      override def extractTimestamp(t: ApacheLogEvent) = t.timestamp
    })


    val aggResult: DataStream[PageViewCount] = transforEventStream
      .filter(_.methord == "GET")
      .keyBy(_.url)
      .timeWindow(Time.minutes(10), Time.seconds(5))
      .allowedLateness(Time.minutes(1))
      .sideOutputLateData(new OutputTag[ApacheLogEvent]("late"))
      .aggregate(new PageCountAgg, new PageCountWindow)

    transforStream.print("dataStream")
    aggResult.print("aggStream")
    aggResult.getSideOutput(new OutputTag[ApacheLogEvent]("late")).print()
    aggResult.keyBy(_.windowEnd).process(new HotProcessFunction(5)).print("top N")

    executionEnvironment.execute("transform")

  }
}

class PageCountAgg extends AggregateFunction[ApacheLogEvent, Long, Long] {
  override def createAccumulator(): Long = 0l

  override def add(value: ApacheLogEvent, accumulator: Long): Long = accumulator + 1

  override def getResult(accumulator: Long): Long = accumulator

  override def merge(a: Long, b: Long): Long = a + b
}


class PageCountWindow extends WindowFunction[Long, PageViewCount, String, TimeWindow] {
  override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[PageViewCount]): Unit = {
    out.collect(PageViewCount(key, window.getEnd, input.iterator.next()))

  }
}


class HotProcessFunction(topSize: Int) extends KeyedProcessFunction[Long, PageViewCount, String] {

  var itemState: ListState[PageViewCount] = _
  var hotUrlState:MapState[String,Long]=_

  override def open(parameters: Configuration): Unit = {
    super.open(parameters)
    // 命名状态变量的名字和状态变量的类型
//    val itemsStateDesc = new ListStateDescriptor[PageViewCount]("itemState-state", classOf[PageViewCount])
    val hotUrlStateDesc=new MapStateDescriptor[String,Long]("hot-url-state",classOf[String],classOf[Long])
    // 定义状态变量
//    itemState = getRuntimeContext.getListState(itemsStateDesc)
    hotUrlState = getRuntimeContext.getMapState(hotUrlStateDesc)
  }

  override def processElement(i: PageViewCount, context: KeyedProcessFunction[Long, PageViewCount, String]#Context, collector: Collector[String]): Unit = {
//    itemState.add(i)
    context.timerService().registerEventTimeTimer(i.windowEnd + 1)
    hotUrlState.put(i.url,i.count)
    //负责清空状态的定时器,这时窗口已经彻底关闭,不再有聚合结果输出,可以清空状态
    context.timerService().registerEventTimeTimer(i.windowEnd + 60000l)
  }

  //参数timestamp是窗口触发的时间戳,之前注册的每个定时器都会走onTimer方法
  override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Long, PageViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
//    val allPageViewCount: ListBuffer[PageViewCount] = ListBuffer.empty
//    val iterable = itemState.get().iterator()
//    while (iterable.hasNext) {
//      allPageViewCount += iterable.next()
//    }
//
//    itemState.clear()
    if(timestamp==ctx.getCurrentKey+60000l){
      hotUrlState.clear()
      return
    }
    val allPageViewCount:ListBuffer[(String,Long)]=ListBuffer()
    val iterable: util.Iterator[Map.Entry[String, Long]] = hotUrlState.entries().iterator()
    while (iterable.hasNext){
      val map: Map.Entry[String, Long] = iterable.next()
      allPageViewCount+=((map.getKey,map.getValue))
    }

    val sortedItems = allPageViewCount.sortBy(_._2)(Ordering.Long.reverse).take(topSize)

    // 将排名信息格式化成 String, 便于打印
    val result: StringBuilder = new StringBuilder
    result.append("====================================\n")
    result.append("时间: ").append(new Timestamp(timestamp - 1)).append("\n")

    for (i <- sortedItems.indices) {
      val currentItem: (String,Long) = sortedItems(i)
      // e.g.  No1:  商品ID=12224  浏览量=2413
      result.append("No").append(i + 1).append(":")
        .append("  页面url=").append(currentItem._1)
        .append("  浏览量=").append(currentItem._2).append("\n")
    }
    result.append("====================================\n\n")
    // 控制输出频率,模拟实时滚动结果
    Thread.sleep(1000)
    out.collect(result.toString)


  }
}