一 ,介绍 :

1 ,介绍 :

Spark Streaming 类似于 Apache Storm,用于流式数据的处理。根据其官方文档介绍,Spark Streaming 有高吞吐量和容错能力强等特点。Spark Streaming 支持的数据输入源很多,例如:Kafka、Flume、Twitter、ZeroMQ 和简单的 TCP 套接字等等。数据输入后可以用 Spark 的高度抽象原语如 :map、reduce、join、window 等进行运算。而结果也能保存在很多地方,如 HDFS,数据库等。另外 Spark Streaming 也能和 MLlib(机器学习)以及 Graphx 完美融合。

2 ,架构图 : 输入 - 转换 - 输出

spark做抽样分析_apache


spark做抽样分析_spark_02

3 ,离散化流 : 每个一段时间搜集数据,形成 rdd

和Spark基于RDD的概念很相似,Spark Streaming使用离散化流(discretized stream)作为抽象表示,叫作DStream。DStream 是随时间推移而收到的数据的序列。在内部,每个时间区间收到的数据都作为 RDD 存在,而 DStream 是由这些 RDD 所组成的序列(因此 得名“离散化”)。

4 ,输入 - 转换 - 输出 :

DStream 可以从各种输入源创建,比如 Flume、Kafka 或者 HDFS。创建出来的DStream 支持两种操作,一种是转化操作(transformation),会生成一个新的DStream,另一种是输出操作(output operation),可以把数据写入外部系统中。DStream 提供了许多与 RDD 所支持的操作相类似的操作支持,还增加了与时间相关的新操作,比如滑动窗口。

5 ,Spark 与 Storm 的对比

spark做抽样分析_spark_03

二 ,抽象 :

1 ,Dstream : 离散化流

  1. 跟 storm 不一样
  2. 不是连续不断的水流
  3. 是一滴一滴的水滴

2 ,怎么用 :

间隔时间 : 自己定

3 ,时间线 - 业务线 :

spark做抽样分析_Streaming_04

4 ,spark streaming 整体架构 :

spark做抽样分析_Streaming_05

三 ,入门案例 :

1 ,wc :思路

用 sparkStreaming 做 wc

2 ,安装 nc :

yum install -y nc.x86_64

3 ,计划 :

  1. 用 nc 发送数据到制定端口。
  2. 用 sparkStreaming 监控那个端口。
  3. 处理得到的数据。

4 ,日志控制 : log4j.properties

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Set everything to be logged to the console
log4j.rootCategory=error, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN

# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

spark做抽样分析_spark_06

5 ,代码 :

package day06_sparkStreaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object Demo01_wc {
    def main(args: Array[String]): Unit = {
        //  sc
        val ssWc: SparkConf = new SparkConf().setMaster("local[*]").setAppName("ssWc")
        //  ssc,5 秒
        val ssc: StreamingContext = new StreamingContext(ssWc,Seconds(5))
        //  监控端口 : 得到一行一行的数据 ( 跟 textFile 没有本质的区别 )
        val lineDStream: ReceiverInputDStream[String] = ssc.socketTextStream("node01",9999)
        //  切割
        val wordsDStream: DStream[String] = lineDStream.flatMap(_.split(" "))
        //  映射元组
        val yuanDStream: DStream[(String, Int)] = wordsDStream.map((_,1))
        //  计算
        yuanDStream.reduceByKey(_+_).print()
        ssc.start()
        ssc.awaitTermination()
    }
}

6 ,测试 :

  1. 在 node01 发数据 :
  2. 在 idea 启动程序,监控端口 :
  3. 在 node01 发数据 :

7 ,正确的结果 : 每五秒搜集一次数据

spark做抽样分析_apache_07

四 ,自定义接收器 :

1 ,接收到的数据类型 - String ,默认存储级别 - 内存 :

package day06_sparkStreaming

import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver

class CustomerReceiver extends Receiver[String](StorageLevel.MEMORY_ONLY)  {
    override def onStart(): Unit = ???
    override def onStop(): Unit = ???
}

2 ,全部代码 :

package day06_sparkStreaming

import java.io.{BufferedReader, InputStreamReader}
import java.net.Socket
import java.nio.charset.StandardCharsets

import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver

class CustomerReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_ONLY)  {
    //  启动接收器,调用 ( 核心代码 )
    override def onStart(): Unit = {
        //  开启一个线程来接收数据
        new Thread("receiver"){
            override def run() : Unit = {
                //  接收数据,交给框架
                receive()
            }
        }.start()
    }
    //  关闭接收器,调用
    override def onStop(): Unit = {}
    //  接收数据
    def receive():Unit = {
        var socket: Socket = null
        var userInput: String = null
        try {
            //  建立连接
            socket = new Socket(host, port)
            //  读入流
            val reader = new BufferedReader(new InputStreamReader(socket.getInputStream(), StandardCharsets.UTF_8))
            //  读一行
            userInput = reader.readLine()
            //  循环读
            while(!isStopped && userInput != null) {
                //  存储
                store(userInput)
                //  接着读
                userInput = reader.readLine()
            }
            reader.close()
            socket.close()
            //  重连
            restart("Trying to connect again")
        } catch {
            case e: java.net.ConnectException =>
                // restart if could not connect to server
                restart("Error connecting to " + host + ":" + port, e)
            case t: Throwable =>
                // restart if there is any other error
                restart("Error receiving data", t)
        }
    }
}

3 ,使用自定义接收器 :

package day06_sparkStreaming

import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object Demo01_wc {
    def main(args: Array[String]): Unit = {
        //  sc
        val ssWc: SparkConf = new SparkConf().setMaster("local[*]").setAppName("ssWc")
        //  ssc,5 秒
        val ssc: StreamingContext = new StreamingContext(ssWc,Seconds(5))
        //  监控端口 : 得到一行一行的数据 ( 跟 textFile 没有本质的区别 )
        //  val lineDStream: ReceiverInputDStream[String] = ssc.socketTextStream("node01",9999)
        val lineDStream: ReceiverInputDStream[String] = ssc.receiverStream(new CustomerReceiver("node01",9999))
        //  切割
        val wordsDStream: DStream[String] = lineDStream.flatMap(_.split(" "))
        //  映射元组
        val yuanDStream: DStream[(String, Int)] = wordsDStream.map((_,1))
        //  计算
        yuanDStream.reduceByKey(_+_).print()
        ssc.start()
        ssc.awaitTermination()
    }
}

4 ,测试 :

  1. 在 node01 启动 nc 。
  2. 在 idea 启动程序,监控端口 。
  3. 在 node01 发数据 。
  4. 结果 : 在 idea 看到统计结果 。