基于IDEA、maven、VirtualBox构建FlinkSQL、FlinkTableAPI开发环境

原创

曾照彩云 2022-06-01 15:27:06 ©著作权

文章标签 flink apache sql 文章分类 大数据 Word文档导入

©著作权归作者所有：来自51CTO博客作者曾照彩云的原创作品，请联系作者获取转载授权，否则将追究法律责任

一、FlinkTableAPI与FlinkSQL背景介绍

自 2015 年开始，阿里巴巴开始调研开源流计算引擎，最终决定基于 Flink 打造新一代计算引擎，针对 Flink 存在的不足进行优化和改进，并且在 2019 年初将最终代码开源，也就是我们熟知的 Blink。Blink 在原来的 Flink 基础上最显著的一个贡献就是 Flink SQL 的实现。

Flink SQL 是面向用户的 API 层，在我们传统的流式计算领域，比如 Storm、Spark Streaming 都会提供一些 Function 或者 Datastream API，用户通过 Java 或 Scala 写业务逻辑，这种方式虽然灵活，但有一些不足，比如具备一定门槛且调优较难，随着版本的不断更新，API 也出现了很多不兼容的地方。

Flink本身是批流统一的处理框架，所以Table API和SQL，就是批流统一的上层处理API。

Table API是一套内嵌在Java和Scala语言中的查询API，它允许我们以非常直观的方式，组合来自一些关系运算符的查询（比如select、filter和join）。而对于Flink SQL，就是直接可以在代码中写SQL，来实现一些查询（Query）操作。Flink的SQL支持，基于实现了SQL标准的Apache Calcite（Apache开源SQL解析工具）。

无论输入是批输入还是流式输入，在这两套API中，指定的查询都具有相同的语义，得到相同的结果。

二、在Oracle VM VirtualBox下搭建开发环境

Flink1.10以后的版本，window系统与Flink存在不兼容的问题，即使能将服务运行起来，也存在各种问题，比如Task Slots为0、taskmanage窗口崩溃等。建议直接使用Linux或mac，如果习惯在windows下开发，可以使用虚拟下跑linux系统的方式进行开发。

基于IDEA、maven、VirtualBox构建FlinkSQL、FlinkTableAPI开发环境_sql

在虚拟机里启动Ubuntu，安装配置SSH服务，在本机使用FinalSheel连接虚拟机。基于IDEA、maven、VirtualBox构建FlinkSQL、FlinkTableAPI开发环境_apache_02

在虚拟机里找一个目录来存储下载的Flink安装文件，我们这里使用的版本是flink-1.14.4。下载地址是：

https://www.apache.org/dyn/closer.lua/flink/flink-1.14.4/flink-1.14.4-bin-scala_2.12.tgz

下载后再解压：

tar - zxvf flink-1.14.4-bin-scala_2.12.tgz

开发过程中一般会与MySQL、kafka、Elasticsearch等中间件连接，需要下载依赖的jar包，可以在官网中下载（https://nightlies.apache.org/flink/flink-docs-release-1.14/zh/docs/connectors/table/elasticsearch/）。

基于IDEA、maven、VirtualBox构建FlinkSQL、FlinkTableAPI开发环境_sql_03

上述准备工作完成后，切换到flink的bin目录，启动flink服务：

root@zwg:/home/zwg/flink-1.14.4/bin# ./start-cluster.sh 
Starting cluster.
Starting standalonesession daemon on host zwg.
Starting taskexecutor daemon on host zwg.

看到上述启动信息后，服务就启动成功了。再在当前目录启动FlinkSQL客户端工具，

  root@zwg:/home/zwg/flink-1.14.4/bin# ./sql-client.sh

基于IDEA、maven、VirtualBox构建FlinkSQL、FlinkTableAPI开发环境_flink_04

到此为止，就能够进行FlinkSQL开发与调试了。

基于IDEA、maven、VirtualBox构建FlinkSQL、FlinkTableAPI开发环境_sql_05

当然，我们实际在配置上述开发环境过程中像电视中主角的命运一样，不可能一帆风顺，总是充满了各种离奇曲折的经历。之前笔者首次安装配置Flink时，由于多次执行start-cluster.sh脚本，在添加完依赖的jar后，又只stop-cluster.sh一次，导致jar一直没有生效，执行Flink sql脚本总是失败，那种无助感真是不愿再次回忆。

三、IDEA、maven搭建Flink Table API的开发环境

两种方式：1、命令行方式：

mvn archetype:generate -DarchetypeGroupId=org.apache.flink  -DarchetypeArtifactId=flink-quickstart-java  -DarchetypeVersion=1.9.2

基于IDEA、maven、VirtualBox构建FlinkSQL、FlinkTableAPI开发环境_sql_06

2、在IDEA中直接建maven工程：

基于IDEA、maven、VirtualBox构建FlinkSQL、FlinkTableAPI开发环境_apache_07

然后一步步往下走，直到结束。

一个简单的统计示例：

StreamingJob.java

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.example;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

public class StreamingJob {

  public static void main(String[] args) throws Exception {
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    DataStreamSource<String> text = env.socketTextStream("127.0.0.1", 18081, "\n");

    DataStream<WordWithCount> windowCount = text.flatMap(new FlatMapFunction<String, WordWithCount>() {
      public void flatMap(String value, Collector<WordWithCount> out) throws Exception {
        String[] splits = value.split("\\s");
        for (String word:splits) {
          out.collect(new WordWithCount(word,1L));
        }
      }
    })
        .keyBy("word")
        .timeWindow(Time.seconds(5),Time.seconds(1))
        .sum("count");
    windowCount.print().setParallelism(1);
    env.execute("Flink Streaming Java API Skeleton");
  }

  public static class WordWithCount{
    public String word;
    public long count;
    public WordWithCount(){}
    public WordWithCount(String word, long count) {
      this.word = word;
      this.count = count;
    }

    @Override
    public String toString() {
      return "WordWithCount{" +
          "word='" + word + '\'' +
          ", count=" + count +
          '}';
    }
  }
}

WordWithCount.java

package org.example;

public class WordWithCount {
    public String word;
    public long count;
    public WordWithCount(){}
    public WordWithCount(String word, long count) {
        this.word = word;
        this.count = count;
    }

    @Override
    public String toString() {
        return "WordWithCount{" +
                "word='" + word + '\'' +
                ", count=" + count +
                '}';
    }
}

zai 在本机监听 18081端口：

nc -l -p 18081

运行 StreamingJob 类，并在命令行输入一些字符：

基于IDEA、maven、VirtualBox构建FlinkSQL、FlinkTableAPI开发环境_flink_08

统计程序响应：

基于IDEA、maven、VirtualBox构建FlinkSQL、FlinkTableAPI开发环境_apache_09

四、小结与梳理

在虚拟机中直接运行Flink是比较方便的，在此不建议在虚拟机里再使用docker容器跑Flink，增加了维护的困难程度。开发环境的配置，只是万里长征的第一步，后续的开发会有更多的问题等待着我们，建议在开发前，首先深刻透彻理解Flink的各种概念，精读官方文档（https://nightlies.apache.org/flink/flink-docs-release-1.14/zh/docs/concepts/overview/）数遍，官方的案例依次手动敲一遍，就能减少绝大部分由于想当然而导致的问题。