spark insert overwrite元数据不更新 spark streaming数据源

转载

mob64ca1410eb61 2024-03-10 23:37:04

文章标签 FileStream 数据 ide 文章分类 Spark 大数据

基本数据源

1.文件流

从文件中读取数据

lines= ssc.textFileStream("file:///usr/local/spark/mycode/streaming/logfile")

2.套接字流

Spark Streaming可以通过Socket端口监听并接收数据，然后进行相应处理。

JavaReceiverInputDStream<String> lines = jsc.socketTextStream("weekend10", 9999); //9999 是nc -lk 9999的端口号

3.RDD队列流

在调试Spark Streaming应用程序的时候，我们可以使用streamingContext.queueStream(queueOfRDD)创建基于RDD队列的DStream。

高级数据源

1.Apache Kafka作为DStream数据源

2.Apache Flume作为DStream数据源

3.DStream转换操作

4.DStream输出操作

foreachRDD 算子：功能就是将DStream 转换为一个个底层的RDD。foreachRDD算子之内，获取到的RDD算子之外的代码是在Driver端执行的。必须对抽取出来的RDD执行action类算子，代码才能执行。

每batchinterval执行一次foreachRDD，可以使用这个算子做到动态的改变广播变量。

注意：

/*
* counts是DStream 而foreachRDD是outputoperator类的算子 
* foreachRDD的作用就是拿到DStream的的RDD。
* foreachRDD的方法中参数就是DStream中的RDD。 只要是带pair格式的RDD就是K.V格式的RDD。
* foreachRDD方法之内，RDD算子之外的都是在Driver中执行。
*/
 /**
  * 1、local的模拟线程数必须大于等于2 因为一条线程被receiver(接受数据的线程)占用，另外一个线程是job执行
  * 2、Durations时间的设置，就是我们能接受的延迟度，这个我们需要根据集群的资源情况以及监控每一个job的执行时间来调节出最佳时间。
  * 3、 创建JavaStreamingContext有两种方式 （sparkconf、sparkcontext）
  * 4、业务逻辑完成后，需要有一个output operator
  * 5、JavaStreamingContext.start()straming框架启动之后是不能在次添加业务逻辑
  * 6、JavaStreamingContext.stop()无参的stop方法会将sparkContext一同关闭，stop(false) ,默认为true，会一同关闭
  * 7、JavaStreamingContext.stop()停止之后是不能在调用start   
  */
 public class WordCountOnline {
@SuppressWarnings("deprecation")
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("WordCountOnline");
/**
* 在创建streaminContext的时候 设置batch Interval
*/
JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(5));
 // JavaSparkContext sc = new JavaSparkContext(conf);
 // JavaStreamingContext jsc = new JavaStreamingContext(sc,Durations.seconds(5));
JavaReceiverInputDStream<String> lines = jsc.socketTextStream("node5", 9999);
JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
private static final long serialVersionUID = 1L;
@Override
public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
});
JavaPairDStream<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {
private static final long serialVersionUID = 1L;
@Override
public Tuple2<String, Integer> call(String s) {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairDStream<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {
private static final long serialVersionUID = 1L;
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});
//outputoperator类的算子   
 // counts.print();
   /*counts.foreachRDD(new VoidFunction<JavaPairRDD<String,Integer>>() {
private static final long serialVersionUID = 1L;
@Override
public void call(JavaPairRDD<String, Integer> pairRDD) throws Exception {
pairRDD.foreach(new VoidFunction<Tuple2<String,Integer>>() {
private static final long serialVersionUID = 1L;
@Override
public void call(Tuple2<String, Integer> tuple)
throws Exception {
System.out.println("tuple ---- "+tuple );
}
});
}
});*/
   jsc.start();
   //等待spark程序被终止
   jsc.awaitTermination();
   jsc.stop(false);
}
 }

transform 算子：功能就是将DStream 转换为一个个底层的RDD ；从而实现将DStream中的RDD到其他类型RDD的任意操作

transform算子之内，获取到的RDD算子之外的代码是在Driver端执行的。

每batchinterval执行一次foreachRDD，可以使用这个算子做到动态的改变广播变量

/**
  * transform:
  * 通过对Dstream中的每个RDD应用RDD到RDD函数，来返回一个新的DStream。这可以用于对DStream进行任意RDD操作。
  *
  */
 public class Operate_transform {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local").setAppName("Operate_transform");
JavaStreamingContext jsc = new JavaStreamingContext(conf,Durations.seconds(5));
JavaDStream<String> textFileStream = jsc.textFileStream("data");
/*
* transform算子的作用：获取到DStream中的RDD 然后将DStream的RDD进行格式的变换 如将非K、V格式的RDD转换为K、V格式的RDD
* 那么最后返回的DStream中的RDD格式就是转换后的RDD的格式。
*/


textFileStream.transform(new Function<JavaRDD<String>,JavaRDD<String>>(){
private static final long serialVersionUID = 1L;
public JavaRDD<String> call(JavaRDD<String> v1) throws Exception {
v1.foreach(new VoidFunction<String>() {
private static final long serialVersionUID = 1L;
public void call(String t) throws Exception {
System.err.println("**************"+t);
}
});
return v1;
}

}).print();
jsc.start();
jsc.awaitTermination();
jsc.close();
}
 }

updateStateByKey 算子：它是transformation算子

updateStateByKey作用：统计的数据是从最开始批次传入的数据开始统计。而不是统计每一批次的数据。它不能统计某一天或者某个小时的数据。

注意：如果要不断的更新每个key的state，就一定涉及到了状态的保存和容错，这个时候就需要开启checkpoint机制和功能。

/**
  * updateStateByKey:
  * 返回一个新的“状态”Dstream,通过给定的func来更新之前的每个状态的key对应的value值，这也可以用于维护key的任意状态数据。
  * 注意：作用在（K,V）格式的DStream上
  * 
  * updateStateByKey的主要功能: 1、Spark Streaming中为每一个Key维护一份state状态，state类型可以是任意类型的的，
  * 可以是一个自定义的对象，那么更新函数也可以是自定义的。 2、通过更新函数对该key的状态不断更新，对于每个新的batch而言，Spark
  * Streaming会在使用updateStateByKey的时候为已经存在的key进行 state的状态更新
  * （对于每个新出现的key，会同样的执行state的更新函数操作），
  * 如果要不断的更新每个key的state，就一定涉及到了状态的保存和容错，这个时候就需要开启checkpoint机制和功能
  * 
  * @author root
  *
  */
 public class Operate_updateStateByKey {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local").setAppName("Operate_count");
JavaStreamingContext jsc = new JavaStreamingContext(conf,Durations.seconds(5));
jsc.checkpoint("checkpoint");
JavaDStream<String> textFileStream = jsc.textFileStream("data");
/**
* 实现一个累加统计word的功能
*/
JavaPairDStream<String, Integer> mapToPair = textFileStream.flatMap(new FlatMapFunction<String, String>() {
private static final long serialVersionUID = 1L;
public Iterable<String> call(String t) throws Exception {
return Arrays.asList(t.split(" "));
}
}).mapToPair(new PairFunction<String, String, Integer>() {
private static final long serialVersionUID = 1L;
public Tuple2<String, Integer> call(String t) throws Exception {
return new Tuple2<String, Integer>(t.trim(), 1);
}
});
JavaPairDStream<String, Integer> updateStateByKey = mapToPair.updateStateByKey(new Function2<List<Integer>, Optional<Integer>, Optional<Integer>>() {
private static final long serialVersionUID = 1L;
public Optional<Integer> call(List<Integer> values, Optional<Integer> state)
throws Exception {
/**
* values:经过分组最后 这个key所对应的value  [1,1,1,1,1]
* state:这个key在本次之前之前的状态
*/
Integer updateValue = 0;
if(state.isPresent()){
updateValue = state.get();
}
for(Integer i : values){
updateValue += i;
}
return Optional.of(updateValue);
}
});
updateStateByKey.print();
jsc.start();
jsc.awaitTermination();
jsc.close();
}
 }

窗口函数：reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks]): 获取一定时间内的数据。

在（K,V）格式的Dstream上使用时，每个K对应的V通过传入的func函数进行聚合操作，返回一个（K,V）格式的新Dstream

下面的方法中 Durations.seconds(20) 是窗口的长度为20秒； Durations.seconds(10)窗口的时间间隔，也就是说这个窗口函数式每隔10秒执行最近20秒的的数据。

public class Operate_reduceByKeyAndWindow {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local").setAppName("Operate_countByWindow");
JavaStreamingContext jsc = new JavaStreamingContext(conf,Durations.seconds(5));
jsc.checkpoint("checkpoint");
JavaDStream<String> textFileStream = jsc.textFileStream("data");
/**
* 首先将textFileStream转换为tuple格式统计word字数
*/
JavaPairDStream<String, Integer> mapToPair = textFileStream.flatMap(new FlatMapFunction<String, String>() {

private static final long serialVersionUID = 1L;

public Iterable<String> call(String t) throws Exception {
return Arrays.asList(t.split(" "));
}
}).mapToPair(new PairFunction<String, String, Integer>() {

private static final long serialVersionUID = 1L;

public Tuple2<String, Integer> call(String t) throws Exception {
return new Tuple2<String, Integer>(t.trim(), 1);
}
});

JavaPairDStream<String, Integer> reduceByKeyAndWindow = 
mapToPair.reduceByKeyAndWindow(new Function2<Integer,Integer,Integer>(){
private static final long serialVersionUID = 1L;

public Integer call(Integer v1, Integer v2) throws Exception {
return v1+v2;
}

}, Durations.seconds(20), Durations.seconds(10));


reduceByKeyAndWindow.print();

jsc.start();
jsc.awaitTermination();
jsc.close();
}
 }

窗口操作理解图：

spark insert overwrite元数据不更新 spark streaming数据源_数据

假设每隔5s 1个batch,上图中窗口长度为15s，窗口滑动间隔10s。

窗口长度和滑动间隔必须是batchInterval的整数倍。如果不是整数倍会检测报错。

窗口函数的优化：

/**
  * reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks]):
  * 窗口长度（windowLength）：窗口的持续时间
  * 滑动间隔（slideInterval）：执行窗口操作的间隔
  * 这是比上一个reduceByKeyAndWindow()更有效的版本，
  * 根据上一个窗口的reduce value来增量地计算每个窗口的当前的reduce value值，
  * 这是通过处理进入滑动窗口的新数据，以及“可逆的处理”离开窗口的旧数据来完成的。
  * 一个例子是当窗口滑动时，“添加”和“减少”key的数量。
  * 然而，它仅适用于“可逆的reduce 函数”，即具有相应“可逆的reduce”功能的reduce函数（作为参数invFunc）。
  * 像在reduceByKeyAndWindow中，reduce task的数量可以通过可选参数进行配置。
 * 请注意，使用此操作必须启用 checkpointing 。即：优化的窗口函数需要checkpoint。
  * 以上的意思就是 传一个参数的reduceByKeyAndWindow每次计算包含多个批次，每次都会从新计算。造成效率比较低，因为存在重复计算数据的情况
  * 传二个参数的reduceByKeyAndWindow 是基于上次计算过的结果，计算每次key的结果，可以画图示意。
  * @author root
  *
  */
 public class Operate_reduceByKeyAndWindow_2 {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setMaster("local").setAppName("Operate_countByWindow");
JavaStreamingContext jsc = new JavaStreamingContext(conf,Durations.seconds(5));
jsc.checkpoint("checkpoint");
JavaDStream<String> textFileStream = jsc.textFileStream("data");
/**
* 首先将textFileStream转换为tuple格式统计word字数
*/
JavaPairDStream<String, Integer> mapToPair = textFileStream.flatMap(new FlatMapFunction<String, String>() {
private static final long serialVersionUID = 1L;
public Iterable<String> call(String t) throws Exception {
return Arrays.asList(t.split(" "));
}
}).mapToPair(new PairFunction<String, String, Integer>() {
private static final long serialVersionUID = 1L;
public Tuple2<String, Integer> call(String t) throws Exception {
return new Tuple2<String, Integer>(t.trim(), 1);
}
});

JavaPairDStream<String, Integer> reduceByKeyAndWindow = mapToPair.reduceByKeyAndWindow(new Function2<Integer, Integer, Integer>() {
private static final long serialVersionUID = 1L;
/**
* 这里的v1是指上一个所有的状态的key的value值（如果有出去的某一批次值，v1就是下面第二个函数返回的值），v2为本次的读取进来的值*/
public Integer call(Integer v1, Integer v2) throws Exception {
System.out.println("***********v1*************"+v1);
System.out.println("***********v2*************"+v2);
return v1+v2;
}
}, new Function2<Integer,Integer,Integer>(){

private static final long serialVersionUID = 1L;
/**
* 这里的这个第二个参数的Function2是在windowLength时间后才开始执行，v1是上面一个函数刚刚加上最近读取过来的key的value值的最新值,
* v2是窗口滑动后，滑动间隔中出去的那一批值
* 返回的值又是上面函数的v1 的输入值
*/
public Integer call(Integer v1, Integer v2) throws Exception {

System.out.println("^^^^^^^^^^^v1^^^^^^^^^^^^^"+v1);
System.out.println("^^^^^^^^^^^v2^^^^^^^^^^^^^"+v2);

 // return v1-v2-1;//每次输出结果递减1 
return v1-v2;
}

}, Durations.seconds(20), Durations.seconds(10));reduceByKeyAndWindow.print();

jsc.start();
jsc.awaitTermination();
jsc.close();
}
 }

SparkStreaming的其他transform算子

Driver HA（Standalone或者Mesos）

因为SparkStreaming是7*24小时运行，Driver只是一个简单的进程，有可能挂掉，所以实现Driver的HA就有必要（如果使用的Client模式就无法实现Driver HA ，这里针对的是cluster模式）。Yarn平台cluster模式提交任务，AM(AplicationMaster)相当于Driver，如果挂掉会自动启动AM。这里所说的DriverHA针对的是Spark standalone和Mesos资源调度的情况下。实现Driver的高可用有两个步骤:

第一：提交任务层面，在提交任务的时候加上选项

第二：代码层面，使用JavaStreamingContext.getOrCreate（checkpoint路径，JavaStreamingContextFactory）

Driver中元数据包括：

1. 创建应用程序的配置信息。

2. DStream的操作逻辑。

3. job中没有完成的批次数据，也就是job的执行进度。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。