文章内容

大数据环境之下,各种数据来源参差不齐,本次数据来源于 kochava,kochava 译为 快出海,是全球领先的移动广告归因分析与优化及全球最大的DMP,有兴趣的伙伴可以了解一下这款产品 https://www.kchuhai.com/company/view-60.html ,本次的任务就是从该产品中获取数据来源并通过 flink 写入到 Kafka ,最后写入到 Doris中。下面是简单的 数据采集demo内容

一 Python实现访问网页并读取数据发送邮件到邮箱1 访问Kochava网页
class Email(object):

    headers = {
        #注意: 不同电脑的key可能不一样
        'authentication-key': ##key,
        'content-type': 'application/json'
    }


    # 发送邮件返回requestId
    @classmethod
    def getRequestId(cls, appId, appName, sql):
        url = 'https://query.api.kochava.com/async'
        payload = {
            "use_standard_sql": False,
            "app_name"        : appName,
            "app_id"          : appId,
            "query"           : sql,
            "quota_ineligible": False,
            "delivery_method" : ##,
            "emails"          : ##email,
            "export"          : True,
            "large_results"   : True,
            "query_type"      : "custom"
        }

        r = requests.post(url, data=json.dumps(payload), headers=cls.headers)
        print("发送成功!")
        return r.json()["request_id"]

该访问的参数我们可以通过下面的步骤进行获取

Flink读取oss并转换成table flink读取csv_python

Flink读取oss并转换成table flink读取csv_flink_02

Flink读取oss并转换成table flink读取csv_Flink读取oss并转换成table_03

OK,由于 该网站内部已经有了发邮件这个接口功能 ,所以第一步访问网站发送邮件就结束了,可以查看邮箱中是否已经发送成功!

Flink读取oss并转换成table flink读取csv_大数据_04

2 登录邮箱

在这里,本人首先在个人的邮箱中创建了一个名为 Kochava的文件夹,所有只要是Kochava发来的邮件都会保存到该文件夹下,操作如下:

1,登录邮箱,我这边使用的是网易企业邮箱,新建一个文件夹。

Flink读取oss并转换成table flink读取csv_大数据_05

2,设置来信分类。

Flink读取oss并转换成table flink读取csv_Flink读取oss并转换成table_06

3,简单设置一下,就OK了。

Flink读取oss并转换成table flink读取csv_python_07

然后开始 coding

#  登录邮箱
    @classmethod
    def connectEmailByImap(cls):
        hostname = ''
        port = 
        username = ''

        client = IMAPClient(hostname, port, ssl= True)

        try:
            #登录个人帐号
            client.login(username, passwd)
            print("登录成功!")
            return client
        except client.Error:
            print('登录失败')
            sys.exit(1)

显而易见,hostname 是对应的协议的服务器地址,port 就是端口号,uesrname 就比较有讲究了 ,它可以是邮箱登录的密码,当然也可以是客户端授权密码,如果获取客户端授权密码以及如何得到这个密码,后面会有相应的说明。不过,以上的这几个参数,不清楚的话可以访问 https://qiye.163.com/help/client-profile.html 网站进行查询。域名就是个人的邮箱,一般port用的都是加密的端口。我这里使用的是 IMAP协议。这一步也可能出现问题,所以如果遇到 SSL需要验证的,我们可以将其关闭,方式如下:

#  登录邮箱
    @classmethod
    def connectEmailByImap(cls):
        hostname = 'imaphz.qiye.163.com'
        port = 993
        username = '*****'
        passwd = '****'


        # mac中验证ssl,将其设置为 False
        ssl_context = ssl.create_default_context()
        ssl_context.check_hostname = False
        ssl_context.verify_mode = False

        client = IMAPClient(hostname, port, ssl= True, ssl_context = ssl_context)


        try:
            #登录个人帐号
            client.login(username, passwd)
            print("登录成功!")
            return client
        except client.Error:
            print('登录失败')
            sys.exit(1)

Flink读取oss并转换成table flink读取csv_flink_08

下面就是获取密码了:

进入邮箱,在邮箱设置中 ,找到如下图的东西

Flink读取oss并转换成table flink读取csv_java_09

开启客户端授权密码,此时会选择随机生成或者自定义,当然,个人觉得如果保密性不是很严格的话,使用自己邮箱登录密码即可

Flink读取oss并转换成table flink读取csv_Flink读取oss并转换成table_10

3 获取URL

登录邮箱成功后,就要开始访问自己定义的文件夹中的邮件了,但是邮件是比较多的,如何才能获取想要的那个邮件呢? 1,使用request_id进行匹配 2,过滤今天的邮件 先上代码:

# 获取Url
    @classmethod
    def getUrl(cls,client,pre_request_id):
        #1 进入kochava邮箱
        client.select_folder('kochava', readonly=True)
        #2 取今日邮件
        #2.1 获取当前时间 类型是struct类型
        localtime = time.localtime(time.time())
        messages = client.search([u'SINCE', datetime(localtime.tm_year, localtime.tm_mon, localtime.tm_mday)])
        #3 获取kochava文件夹中所有邮件及其内容
        for uid,message_data in client.fetch(messages,"RFC822").items():
            email_message  = email.message_from_bytes(message_data[b"RFC822"])
           #获取正文
            print("=======================读取正文部分-===================")
            for part in email_message.walk():
                if part.get_content_type() == 'text/plain':
                    body = part.get_payload()
                   #根据“:”进行分割,去除所有奇怪符号 获取 request_id,status,url
                    request_id = body.split("Request ID:")[1].replace('\n', '').replace('\r', '').replace('=','').split(" ")[4]
                    print("获取的 requestId 为: ",request_id)
                    status = body.split("Status:")[1].replace('\n', '').replace('\r', '').replace('=','').split(" ")[3]
                    print("获取的 status 为: ",status)
                    url = body.split("Link:")[1].replace("=\r",'').replace("\n",'').replace("=3D",'=').split(" ")[4]
                    print("获取的 url 为: ",url)
                   # 匹配,如果是对应的request_id并且发送状态是成功的
                    if (request_id == pre_request_id) and (status == "Completed") and url is not None:
                       #返回 url
                        print("获取 url成功=======>",url)
                        return url
                    else:
                        print("可惜不是你,难受")
                else:
                    email_message.get_payload(decode=True)

在邮件中,可以明显的看到有个 requestId,这个ID是偶然存在的吗?

Flink读取oss并转换成table flink读取csv_python_11

当然不是,追溯之前的步骤,在首次访问 网页时,会有一个请求的返回结果

Flink读取oss并转换成table flink读取csv_大数据_12

在这里,它也记录了这个requestId,所以,要判定哪一封邮件是我请求的,这个requestId就能帮忙找到。

OK,逻辑上解决了,下面就是挑几个函数上的使用:

首先,我们先看下获取到的邮件的消息长啥样:

Flink读取oss并转换成table flink读取csv_python_13

总共拿到的有四个部分,第一部分是头,这部分数据是比较好拿的,两种方式都能获取

print("Subject1:",email_message["Subject"])
print("Subject2:",email_message.get("Subject"))

但是下面的数据就拿不到了,这里有个方法,walk(),这个方法可以循环获取到更深的数据,每次迭代都会拿到子部件,每个部件内容都有相应的类型,我们可以通过 下面的方法看一下。

print(part.get_content_type())

Flink读取oss并转换成table flink读取csv_大数据_14

显而易见,我想要的就是中间那个类型的,那么我只要过滤出类型是 text/plain 这个的,就OK了,当然这也算一种取巧的方式了,如果有两个该类型,可能就需要重新写逻辑了。想了解更多方法,可以进去看看 https://docs.python.org/3/library/email.message.html ,肯定有更好的方法的。

到这,我们已经拿到了我们想要的URL,接下来就是拿着 URL去网上下载压缩包并解压了。

4 下载文件
# 下载文件
    @classmethod
    def download_file_byurl(cls,url,new_name):
        #获取文件的old_name
        old_name = url.split("/")[3].split(".zip")[0]
        filename = wget.detect_filename(url)
        old_path = cls.file_path + "\\" + filename
        new_name = cls.file_path + "\\" + new_name
        print(filename)
        wget.download(url,cls.file_path)
        cls.unzip(filename,old_name,old_path,new_name)
5 解压文件并更改名字
@classmethod
    def unzip(cls,filename,old_name,old_path,new_name):
        #判断文件是否已经下载完成了,如果没有下载完成,等待10s后继续
        if os.path.exists(cls.file_path + "\\" + filename):
            print(cls.file_path + "\\" + filename)
            old_name = cls.file_path + "\\" + old_name

            zip_file = zipfile.ZipFile(old_path)
            # 获取文件中的文件名
            for names in zip_file.namelist():
                zip_file.extract(names,cls.file_path)
            zip_file.close()
            os.rename(old_name,new_name)
            #删除 压缩包
            os.remove(old_path)

        else:
            print("请等待文件下载完成,时间10s")
            threading.Thread(target=wait()).start()
            cls.unzip(filename,old_path,new_name)
6 结束和开始

写到这,基本上文件能下载到本地并成功解压改名了,下面上全代码: 实现:Realize.py

import datetime
import threading
import time
from SendEmail import Email
from DownLoadFile import DownLoadFile


#线程等待
def wait():
    time.sleep(30)


def get_url(client,request_id,new_name):
    url = Email.getUrl(client,request_id)
    if url is not None:
        print("url===================>",url)
        DownLoadFile.download_file_byurl(url,new_name=new_name)
    else:
        print("尚未检测到匹配的url,可能邮件还在发送中,请等待30s...")
        threading.Thread(target=wait()).start()
        get_url(client,request_id,new_name)


if __name__ == '__main__' :
    #sql 和 appId 和 appName
    today = datetime.date.today()
    oneday = datetime.timedelta(days=1)
    yesterday = today - oneday
    sql = "select * from  XXXX where date(date_received) = '" + str(yesterday)  + "';"
    appId = ####
    appName = #####

    #发送邮件获取request_id
    request_id = Email.getRequestId(appId=appId,appName=appName,sql=sql)
    #登录邮箱读取邮件获取 url
    client = Email.connectEmailByImap()

    #通过url下载文件并解压
    day = yesterday.strftime("%Y%m%d")
    new_name = "XXXX_" + str(day) + ".csv"

    get_url(client,request_id,new_name)

SendEmail.py

import json
import sys
import time
import requests
import email
from imapclient import IMAPClient
from datetime import datetime

class Email(object):

    headers = {
        #注意: 不同电脑的key可能不一样
        'authentication-key': ###,
        'content-type': 'application/json'
    }


    # 发送邮件返回requestId
    @classmethod
    def getRequestId(cls, appId, appName, sql):
        url = 'https://query.api.kochava.com/async'
        payload = {
            "use_standard_sql": False,
            "app_name"        : appName,
            "app_id"          : appId,
            "query"           : sql,
            "quota_ineligible": False,
            "delivery_method" : ####,
            "emails"          : ####,
            "export"          : True,
            "large_results"   : True,
            "query_type"      : "custom"
        }

        r = requests.post(url, data=json.dumps(payload), headers=cls.headers)
        print("发送成功!")
        return r.json()["request_id"]

    #  登录邮箱
    @classmethod
    def connectEmailByImap(cls):
        hostname = 'imaphz.qiye.163.com'
        port = 993
        username = ####
        passwd = ####

        client = IMAPClient(hostname, port, ssl= True)

        try:
            #登录个人帐号
            client.login(username, passwd)
            print("登录成功!")
            return client
        except client.Error:
            print('登录失败')
            sys.exit(1)



    # 获取Url
    @classmethod
    def getUrl(cls,client,pre_request_id):
        #1 进入kochava邮箱
        client.select_folder('kochava', readonly=True)
        #2 取今日邮件
        #2.1 获取当前时间 类型是struct类型
        localtime = time.localtime(time.time())
        messages = client.search([u'SINCE', datetime(localtime.tm_year, localtime.tm_mon, localtime.tm_mday)])
        #3 获取kochava文件夹中所有邮件及其内容
        for uid,message_data in client.fetch(messages,"RFC822").items():
            email_message  = email.message_from_bytes(message_data[b"RFC822"])
           #获取正文
            print("=======================读取正文部分-===================")
            for part in email_message.walk():
                if part.get_content_type() == 'text/plain':
                    body = part.get_payload()
                   #根据“:”进行分割,去除所有奇怪符号 获取 request_id,status,url
                    request_id = body.split("Request ID:")[1].replace('\n', '').replace('\r', '').replace('=','').split(" ")[4]
                    print("获取的 requestId 为: ",request_id)
                    status = body.split("Status:")[1].replace('\n', '').replace('\r', '').replace('=','').split(" ")[3]
                    print("获取的 status 为: ",status)
                    url = body.split("Link:")[1].replace("=\r",'').replace("\n",'').replace("=3D",'=').split(" ")[4]
                    print("获取的 url 为: ",url)
                   # 匹配,如果是对应的request_id并且发送状态是成功的
                    if (request_id == pre_request_id) and (status == "Completed") and url is not None:
                       #返回 url
                        print("获取 url成功=======>",url)
                        return url
                    else:
                        print("可惜不是你,难受")
                else:
                    email_message.get_payload(decode=True)

DownloadFile.py

import os
import threading
import time
import urllib.request
import zipfile
import wget

# coding by BradySuen
# time:2021/08/17


#线程等待
def wait():
    time.sleep(10)



class DownLoadFile(object):

    file_path = "E:\kochava数据1"



    #解压文件
    @classmethod
    def unzip(cls,filename,old_name,old_path,new_name):
        #判断文件是否已经下载完成了,如果没有下载完成,等待10s后继续
        if os.path.exists(cls.file_path + "\\" + filename):
            print(cls.file_path + "\\" + filename)
            old_name = cls.file_path + "\\" + old_name

            zip_file = zipfile.ZipFile(old_path)
            # 获取文件中的文件名
            for names in zip_file.namelist():
                zip_file.extract(names,cls.file_path)
            zip_file.close()
            os.rename(old_name,new_name)
            #删除 压缩包
            os.remove(old_path)

        else:
            print("请等待文件下载完成,时间10s")
            threading.Thread(target=wait()).start()
            cls.unzip(filename,old_path,new_name)



    # 下载文件
    @classmethod
    def download_file_byurl(cls,url,new_name):
        #获取文件的old_name
        old_name = url.split("/")[3].split(".zip")[0]
        filename = wget.detect_filename(url)
        old_path = cls.file_path + "\\" + filename
        new_name = cls.file_path + "\\" + new_name
        print(filename)
        wget.download(url,cls.file_path)
        cls.unzip(filename,old_name,old_path,new_name)

结束了吗?这才刚刚开始!

二 Flink 流式读取CSV文件并写入到Kafka

此处切换Java 上 pom 文件

<dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_2.11</artifactId>
            <version>1.13.0</version>
        </dependency>
        <!-- flink kafka -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_2.11</artifactId>
            <version>1.13.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-base</artifactId>
            <version>1.13.0</version>
        </dependency>
        <!-- json -->
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.75</version>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.49</version>
        </dependency>
        <dependency>
            <groupId>cn.hutool</groupId>
            <artifactId>hutool-all</artifactId>
            <version>5.7.7</version>
        </dependency>
        <dependency>
            <groupId>net.sf.opencsv</groupId>
            <artifactId>opencsv</artifactId>
            <version>2.3</version>
        </dependency>
        <dependency>
            <groupId>org.json</groupId>
            <artifactId>json</artifactId>
            <version>20090211</version>
        </dependency>
    </dependencies>

类:

package com.nana.app.csv2kafka;

import cn.hutool.core.io.resource.ResourceUtil;
import cn.hutool.core.text.csv.CsvReader;
import cn.hutool.core.text.csv.CsvUtil;
import com.alibaba.fastjson.JSONObject;
import com.alibaba.fastjson.JSONValidator;
import com.nana.app.dto.ClickIdentifyDto;
import com.nana.app.dto.ImpressionIdentifyDto;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;

import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import java.util.Properties;

/**
 * @author BradySuen
 * @create_time 2021/8/9
 * @description
 **/
public class StreamClickIdentifyCSV2Kafka {
    public static void main(String[] args) throws Exception {

        //创建流式环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //路径
        String path  = "路径";

        //过滤数据
        DataStreamSource<String> resultDataStreamSource = env.addSource(new SourceFunction<String>() {
            @Override
            public void run(SourceContext<String> sourceContext) throws Exception {
                //读取CSV文件并转化为对象
                final CsvReader reader = CsvUtil.getReader();
                List<ClickIdentifyDto> list = reader.read(ResourceUtil.getUtf8Reader(path), ClickIdentifyDto.class);

                //遍历数据将数据转化为JSON
                for (ClickIdentifyDto clickIdentifyDto : list) {
                    SimpleDateFormat sdf  = new SimpleDateFormat("yyyy-MM-dd");
                    clickIdentifyDto.partition_date = sdf.format(
                            new Date(Long.parseLong(clickIdentifyDto.date_received.substring(0,10)) * 1000L));
                    //将对象转换为JSONString
                    String result = JSONObject.toJSONString(clickIdentifyDto);

                    sourceContext.collect(result);
                }
            }

            @Override
            public void cancel() {

            }
        });

        resultDataStreamSource.print();


        //输出数据到KAFKA
        Properties prop = new Properties();
        prop.setProperty("bootstrap.servers","###");
        resultDataStreamSource.addSink(new
                FlinkKafkaProducer<String>("topic", new SimpleStringSchema(),prop));

        env.execute();
    }
}

ClickIdentifyDto:

package com.nana.app.dto;

/**
 * @author BradySuen
 * @create_time 2021/8/11
 * @description
 **/
public class ClickIdentifyDto {
    public String click_id;
    public String date_received;
    public String identifier_type;
    public String identifier_value;
    public String partition_date;
}
三 创建Doris Routine 实时读取Kafka数据
package com.nana.app.kafka2doris;

import com.nana.app.dto.ClickIdentifyDto;
import com.nana.app.dto.InstallInfluencersDto;

import java.lang.reflect.Field;
import java.sql.*;

/**
 * @author BradySuen
 * @create_time 2021/8/11
 * @description
 **/
public class ClickIdentify2Doris {
    public static void main(String[] args) throws ClassNotFoundException, SQLException {
        String URL="jdbc:mysql://<IP>:<port 默认9030>/<库名>?useUnicode=true&characterEncoding=utf-8";
        String USER="##";
        String PASSWORD="##";
        //1.加载驱动程序
        Class.forName("com.mysql.jdbc.Driver");
        //2.获得数据库链接
        Connection conn= DriverManager.getConnection(URL, USER, PASSWORD);
       //3.通过数据库的连接操作数据库,实现增删改查(使用Statement类)
        Statement st=conn.createStatement();


        //反射获取属性(即列)
        Field[] fields = ClickIdentifyDto.class.getDeclaredFields();
        int len = fields.length;

        //拼接属性
        StringBuilder column = new StringBuilder();
        for (Field field : fields) {
            if(len == fields.length){
                column.append("`").append(field.getName()).append("`");
            }else{
                column.append(",").append("`").append(field.getName()).append("`");
            }
            len --;
        }

        //创建Route load
        String jobName = "kochava_db.query_click_identify";
        String tableName = "kochava_click_identifiers";
        StringBuffer task = new StringBuffer();
        task.append("CREATE ROUTINE LOAD ").append(jobName).append(" ON ").append(tableName).append("\n")
                .append("COLUMNS(").append(column).append(")").append("\n")
                .append("PROPERTIES (").append("\n")
                .append("\t").append("\"desired_concurrent_number\"=\"1\"").append(",").append("\n")
                .append("\t").append("\"max_batch_interval\"=\"10\"").append(",").append("\n")
                .append("\t").append("\"max_batch_rows\"=\"300000\"").append(",").append("\n")
                .append("\t").append("\"max_batch_size\"=\"209715200\"").append(",").append("\n")
                .append("\t").append("\"strict_mode\"=\"false\"").append(",").append("\n")
                .append("\t").append("\"format\"=\"json\"").append("\n")
                .append(")").append("\n")
                .append("From KAFKA (").append("\n")
                .append("\t").append("\"kafka_broker_list\"=\"XXXX:9092\"").append(",").append("\n")
                .append("\t").append("\"kafka_topic\"=\"topic\"").append(",").append("\n")
                .append("\t").append("\"property.kafka_default_offsets\"=\"OFFSET_END\"").append(",").append("\n")
                .append("\t").append("\"property.group.id\"=\"gid_click_identify\"").append("\n")
                .append(");");

        st.execute(task.toString());

         //关闭资源
         st.close();
         conn.close();
    }
}