flink 如何实现数据统计 flink每分钟统计

转载

小蝌蚪 2024-05-07 19:10:15

文章标签 flink 如何实现数据统计 flink sed 数据 文章分类 架构后端开发

1、需求

flink 如何实现数据统计 flink每分钟统计_flink

2、需求输出每个窗口访问量最大的5个地址，所以一定是无状态的输出，有两种办法

（1）、keyby后自己使用processfunction自己定义state，否则是有状态的输出

（2）、window后➕windowAll方法

3、正常情况下不考虑乱序的时候（没有窗口延迟关闭1min和延迟数据的侧输出流）

event时间窗口滑动窗口10min，滚动5s输出

程序一启动，就会创建所有的窗口，但是有数据才会输出 [10:15:45-10:25:50), [10:15:50-10:25:55), [10:15:55-10:25:60)

watermark:延时1s

定时器：windowEnd+1s（但是定时器的加1s输出需要依赖watermark的更新，所以在watermark+2s才会触发）

aggDataStream需要窗口触发才会输出

输入的原始数据

输入（在 [10:15:45-10:25:50) ）：89.101 - - 17/05/2015:10:25:49 +0000 GET /presedent

输出：dataStream> ApacheLogEvent(89.101,-,1421461549000,GET,/presedent)

输入：89.101 - - 17/05/2015:10:25:50 +0000 GET /presedent

输出：dataStream> ApacheLogEvent(89.101,-,1421461550000,GET,/presedent)

输入：89.101 - - 17/05/2015:10:25:51 +0000 GET /presedent

输出（watermark对应的[10:15:45-10:25:50)窗口触发）：

dataStream> ApacheLogEvent(89.101,-,1421461551000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,1)

输入：

89.101 - - 17/05/2015:10:25:52 +0000 GET /presedent

输出（定时器触发）：

dataStream> ApacheLogEvent(89.101,-,1421461552000,GET,/presedent)
top N> ====================================
时间: 2015-01-17 10:25:50.0
No1: 页面url=/presedent 浏览量=1
====================================

输入：10:25:56秒的数据时 [10:15:50-10:25:55)窗口触发，会输出agg流结果

输入：89.101 - - 17/05/2015:10:25:57 +0000 GET /presedent

输出（定时器触发）：

dataStream> ApacheLogEvent(89.101,-,1421461557000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461555000,4)
top N> ====================================
时间: 2015-01-17 10:25:55.0
No1: 页面url=/presedent 浏览量=4 包含了[10:15:50-10:25:55)的所有数据
====================================

4、考虑乱序的时候（包含窗口延迟关闭1min和延迟数据的侧输出流）

乱序的三大保证

（1）、watermark的使用

（2）、allowlateness方法使用，窗口延迟关闭，延迟数据依然会进行增量聚合

（3）、输出到侧输出流

event时间窗口滑动窗口10min，滚动5s输出

[10:15:45-10:25:50), [10:15:50-10:25:55), [10:15:55-10:25:60)

watermark:延时1s

定时器：windowEnd+1s（但是定时器的加1s输出需要依赖watermark的更新，所以在watermark+2s才会触发）

alowLatenes 为1min

侧输出流为：late

aggDataStream需要窗口触发才会输出

输入：

输入：89.101 - - 17/05/2015:10:25:49 +0000 GET /presedent
输入：89.101 - - 17/05/2015:10:25:50 +0000 GET /presedent
输入：89.101 - - 17/05/2015:10:25:51 +0000 GET /presedent
输入：89.101 - - 17/05/2015:10:25:52 +0000 GET /presedent
前面这4条记录和上述基本一直，除了窗口延迟1min后才会关闭
输入：89.101 - - 17/05/2015:10:25:46 +0000 GET /presedent
输出：根据之前的输入watermark已经为10:25:51了，所以agg触发了，但是定时器需要新的watermark更新才会触发。
dataStream> ApacheLogEvent(89.101,-,1421461546000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,2)

输入：89.101 - - 17/05/2015:10:25:53 +0000 GET /presedent

输出更新了watermark，所以定时器触发了：

dataStream> ApacheLogEvent(89.101,-,1421461553000,GET,/presedent)
top N> ====================================
时间: 2015-01-17 10:25:50.0
No1: 页面url=/presedent 浏览量=2
====================================

输入：

89.101 - - 17/05/2015:10:25:31 +0000 GET /presedent

输出：原因是允许1min迟到数据，并且有数据，所以输出了，原来没有输出是因为之前没有数据，所以没输出

dataStream> ApacheLogEvent(89.101,-,1421461531000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,3)
aggStream> PageViewCount(/presedent,1421461545000,1)
aggStream> PageViewCount(/presedent,1421461540000,1)
aggStream> PageViewCount(/presedent,1421461535000,1)

当前wartermark为10:25:52s，那么[,10:24:50]的窗口一定已经关闭了，那我们输入[,10:24:49]仍然不会输出到侧输出流里面，因为这条记录还属于[,10:24:55]等多个的窗口，该数据还是会被收进去的，输入侧输出流里面的数据需要该数据属于的所有窗口都已经关闭了，那我们输入一个10min之前的数据就会输出到侧输出流了，因为它的窗口都关闭了。

输入：

89.101 - - 17/05/2015:10:14:51 +0000 GET /presedent

输出：侧输出流就会输出了

5、代码

package flinkProject

import java.sql.Timestamp
import java.text.SimpleDateFormat

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.configuration.ConfigOptions.ListConfigOptionBuilder
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

//事件类
case class ApacheLogEvent(ip: String, userid: String, timestamp: Long, methord: String, url: String)

//窗口聚合结果
case class PageViewCount(url: String, windowEnd: Long, count: Long)

object HotNewworkLogFlow {
  def main(args: Array[String]): Unit = {
    val executionEnvironment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    executionEnvironment.setParallelism(1)
    executionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) //watermark周期性生成，默认是200ms
    //    executionEnvironment.getConfig.setAutoWatermarkInterval(500) //watermark周期性生成，默认是200ms,可以修改为500ms
    val stream2: DataStream[String] = executionEnvironment.socketTextStream("127.0.0.1", 1111)
    val transforStream: DataStream[ApacheLogEvent] = stream2.map(data => {
      val tmpList = data.split(" ")
      val simpleDateFormat = new SimpleDateFormat("dd/mm/yy:HH:mm:ss")
      val ts = simpleDateFormat.parse(tmpList(3)).getTime
      ApacheLogEvent(tmpList(0), tmpList(1), ts, tmpList(5), tmpList(6))
    })

    //增加watermark配置.punctated代表点状的，periodic代表周期性的(一般用这个)
    val transforEventStream: DataStream[ApacheLogEvent] = transforStream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[ApacheLogEvent](Time.seconds(1)) {
      override def extractTimestamp(t: ApacheLogEvent) = t.timestamp
    })


    val aggResult: DataStream[PageViewCount] = transforEventStream
      .filter(_.methord == "GET")
      .keyBy(_.url)
      .timeWindow(Time.minutes(10), Time.seconds(5))
      .allowedLateness(Time.minutes(1))
      .sideOutputLateData(new OutputTag[ApacheLogEvent]("late"))
      .aggregate(new PageCountAgg, new PageCountWindow)

    transforStream.print("dataStream")
    aggResult.print("aggStream")
    aggResult.getSideOutput(new OutputTag[ApacheLogEvent]("late")).print()
    aggResult.keyBy(_.windowEnd).process(new HotProcessFunction(5)).print("top N")

    executionEnvironment.execute("transform")

  }
}

class PageCountAgg extends AggregateFunction[ApacheLogEvent, Long, Long] {
  override def createAccumulator(): Long = 0l

  override def add(value: ApacheLogEvent, accumulator: Long): Long = accumulator + 1

  override def getResult(accumulator: Long): Long = accumulator

  override def merge(a: Long, b: Long): Long = a + b
}


class PageCountWindow extends WindowFunction[Long, PageViewCount, String, TimeWindow] {
  override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[PageViewCount]): Unit = {
    out.collect(PageViewCount(key, window.getEnd, input.iterator.next()))

  }
}


class HotProcessFunction(topSize: Int) extends KeyedProcessFunction[Long, PageViewCount, String] {

  var itemState: ListState[PageViewCount] = _

  override def open(parameters: Configuration): Unit = {
    super.open(parameters)
    // 命名状态变量的名字和状态变量的类型
    val itemsStateDesc = new ListStateDescriptor[PageViewCount]("itemState-state", classOf[PageViewCount])
    // 定义状态变量
    itemState = getRuntimeContext.getListState(itemsStateDesc)
  }

  override def processElement(i: PageViewCount, context: KeyedProcessFunction[Long, PageViewCount, String]#Context, collector: Collector[String]): Unit = {
    itemState.add(i)
    context.timerService().registerEventTimeTimer(i.windowEnd + 1)

  }

  override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Long, PageViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
    val allPageViewCount: ListBuffer[PageViewCount] = ListBuffer.empty
    val iterable = itemState.get().iterator()
    while (iterable.hasNext) {
      allPageViewCount += iterable.next()
    }

    itemState.clear()

    val sortedItems = allPageViewCount.sortBy(_.count)(Ordering.Long.reverse).take(topSize)

    // 将排名信息格式化成 String, 便于打印
    val result: StringBuilder = new StringBuilder
    result.append("====================================\n")
    result.append("时间: ").append(new Timestamp(timestamp - 1)).append("\n")

    for (i <- sortedItems.indices) {
      val currentItem: PageViewCount = sortedItems(i)
      // e.g.  No1：  商品ID=12224  浏览量=2413
      result.append("No").append(i + 1).append(":")
        .append("  页面url=").append(currentItem.url)
        .append("  浏览量=").append(currentItem.count).append("\n")
    }
    result.append("====================================\n\n")
    // 控制输出频率，模拟实时滚动结果
    Thread.sleep(1000)
    out.collect(result.toString)


  }
}

6、原始数据：

没有alowtness和侧输出的测试数据
89.101 - - 17/05/2015:10:25:49 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:50 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:51 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:52 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:56 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:57 +0000 GET /presedent
89.101 - - 17/05/2015:10:26:02 +0000 GET /presedent

有alowtness和侧输出的测试数据
89.101 - - 17/05/2015:10:25:49 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:50 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:51 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:52 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:46 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:53 +0000 GET /presedent
89.101 - - 17/05/2015:10:25:31 +0000 GET /presedent
89.101 - - 17/05/2015:10:14:51 +0000 GET /presedent

7、问题：输入迟到数据不能和历史数据重新排序输出

现在解决了乱序问题后，还有个问题，当有1min内的迟到数据到来时，会重新触发agg方法进行聚合，输出一个agg的结果，例如（url1,3）变成了（url1,4）,这个时候我们要用（url1,4）替换掉（url1,3）并且进行重新排序，所以要保留这个窗口的所有数据在state中，在该窗口的endTime+1min后再清空状态

所以我们定义mapState和一个新的endTime+1min的定时器

输入迟到数据会和历史数据重新排序输出

输出结果如下：

dataStream> ApacheLogEvent(89.101,-,1421461549000,GET,/presedent)
dataStream> ApacheLogEvent(89.101,-,1421461546000,GET,/pre)
dataStream> ApacheLogEvent(89.101,-,1421461551000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,1)
aggStream> PageViewCount(/pre,1421461550000,1)
dataStream> ApacheLogEvent(89.101,-,1421461552000,GET,/presedent)
top N> ====================================
时间: 2015-01-17 10:25:50.0
No1:  页面url=/pre  浏览量=1
No2:  页面url=/presedent  浏览量=1
====================================


dataStream> ApacheLogEvent(89.101,-,1421461546000,GET,/presedent)
aggStream> PageViewCount(/presedent,1421461550000,2)
dataStream> ApacheLogEvent(89.101,-,1421461553000,GET,/presedent)
top N> ====================================
时间: 2015-01-17 10:25:50.0
No1:  页面url=/presedent  浏览量=2
No2:  页面url=/pre  浏览量=1
====================================

优化代码如下：

package flinkProject

import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.util
import java.util.Map

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor, MapState, MapStateDescriptor}
import org.apache.flink.configuration.ConfigOptions.ListConfigOptionBuilder
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment, _}
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

import scala.collection.mutable.ListBuffer

//事件类
case class ApacheLogEvent(ip: String, userid: String, timestamp: Long, methord: String, url: String)

//窗口聚合结果
case class PageViewCount(url: String, windowEnd: Long, count: Long)

object HotNewworkLogFlow {
  def main(args: Array[String]): Unit = {
    val executionEnvironment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    executionEnvironment.setParallelism(1)
    executionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) //watermark周期性生成，默认是200ms
    //    executionEnvironment.getConfig.setAutoWatermarkInterval(500) //watermark周期性生成，默认是200ms,可以修改为500ms
    val stream2: DataStream[String] = executionEnvironment.socketTextStream("127.0.0.1", 1111)
    val transforStream: DataStream[ApacheLogEvent] = stream2.map(data => {
      val tmpList = data.split(" ")
      val simpleDateFormat = new SimpleDateFormat("dd/mm/yy:HH:mm:ss")
      val ts = simpleDateFormat.parse(tmpList(3)).getTime
      ApacheLogEvent(tmpList(0), tmpList(1), ts, tmpList(5), tmpList(6))
    })

    //增加watermark配置.punctated代表点状的，periodic代表周期性的(一般用这个)
    val transforEventStream: DataStream[ApacheLogEvent] = transforStream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[ApacheLogEvent](Time.seconds(1)) {
      override def extractTimestamp(t: ApacheLogEvent) = t.timestamp
    })


    val aggResult: DataStream[PageViewCount] = transforEventStream
      .filter(_.methord == "GET")
      .keyBy(_.url)
      .timeWindow(Time.minutes(10), Time.seconds(5))
      .allowedLateness(Time.minutes(1))
      .sideOutputLateData(new OutputTag[ApacheLogEvent]("late"))
      .aggregate(new PageCountAgg, new PageCountWindow)

    transforStream.print("dataStream")
    aggResult.print("aggStream")
    aggResult.getSideOutput(new OutputTag[ApacheLogEvent]("late")).print()
    aggResult.keyBy(_.windowEnd).process(new HotProcessFunction(5)).print("top N")

    executionEnvironment.execute("transform")

  }
}

class PageCountAgg extends AggregateFunction[ApacheLogEvent, Long, Long] {
  override def createAccumulator(): Long = 0l

  override def add(value: ApacheLogEvent, accumulator: Long): Long = accumulator + 1

  override def getResult(accumulator: Long): Long = accumulator

  override def merge(a: Long, b: Long): Long = a + b
}


class PageCountWindow extends WindowFunction[Long, PageViewCount, String, TimeWindow] {
  override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[PageViewCount]): Unit = {
    out.collect(PageViewCount(key, window.getEnd, input.iterator.next()))

  }
}


class HotProcessFunction(topSize: Int) extends KeyedProcessFunction[Long, PageViewCount, String] {

  var itemState: ListState[PageViewCount] = _
  var hotUrlState:MapState[String,Long]=_

  override def open(parameters: Configuration): Unit = {
    super.open(parameters)
    // 命名状态变量的名字和状态变量的类型
//    val itemsStateDesc = new ListStateDescriptor[PageViewCount]("itemState-state", classOf[PageViewCount])
    val hotUrlStateDesc=new MapStateDescriptor[String,Long]("hot-url-state",classOf[String],classOf[Long])
    // 定义状态变量
//    itemState = getRuntimeContext.getListState(itemsStateDesc)
    hotUrlState = getRuntimeContext.getMapState(hotUrlStateDesc)
  }

  override def processElement(i: PageViewCount, context: KeyedProcessFunction[Long, PageViewCount, String]#Context, collector: Collector[String]): Unit = {
//    itemState.add(i)
    context.timerService().registerEventTimeTimer(i.windowEnd + 1)
    hotUrlState.put(i.url,i.count)
    //负责清空状态的定时器，这时窗口已经彻底关闭，不再有聚合结果输出，可以清空状态
    context.timerService().registerEventTimeTimer(i.windowEnd + 60000l)
  }

  //参数timestamp是窗口触发的时间戳，之前注册的每个定时器都会走onTimer方法
  override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Long, PageViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
//    val allPageViewCount: ListBuffer[PageViewCount] = ListBuffer.empty
//    val iterable = itemState.get().iterator()
//    while (iterable.hasNext) {
//      allPageViewCount += iterable.next()
//    }
//
//    itemState.clear()
    if(timestamp==ctx.getCurrentKey+60000l){
      hotUrlState.clear()
      return
    }
    val allPageViewCount:ListBuffer[(String,Long)]=ListBuffer()
    val iterable: util.Iterator[Map.Entry[String, Long]] = hotUrlState.entries().iterator()
    while (iterable.hasNext){
      val map: Map.Entry[String, Long] = iterable.next()
      allPageViewCount+=((map.getKey,map.getValue))
    }

    val sortedItems = allPageViewCount.sortBy(_._2)(Ordering.Long.reverse).take(topSize)

    // 将排名信息格式化成 String, 便于打印
    val result: StringBuilder = new StringBuilder
    result.append("====================================\n")
    result.append("时间: ").append(new Timestamp(timestamp - 1)).append("\n")

    for (i <- sortedItems.indices) {
      val currentItem: (String,Long) = sortedItems(i)
      // e.g.  No1：  商品ID=12224  浏览量=2413
      result.append("No").append(i + 1).append(":")
        .append("  页面url=").append(currentItem._1)
        .append("  浏览量=").append(currentItem._2).append("\n")
    }
    result.append("====================================\n\n")
    // 控制输出频率，模拟实时滚动结果
    Thread.sleep(1000)
    out.collect(result.toString)


  }
}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：es 第一次搜索慢 es 搜索过程

下一篇：java 识别是否是姓名 java 人名识别怎么实现

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯