大数据环境之下,各种数据来源参差不齐,本次数据来源于 kochava,kochava 译为 快出海,是全球领先的移动广告归因分析与优化及全球最大的DMP,有兴趣的伙伴可以了解一下这款产品 https://www.kchuhai.com/company/view-60.html ,本次的任务就是从该产品中获取数据来源并通过 flink 写入到 Kafka ,最后写入到 Doris中。下面是简单的 数据采集demo内容
class Email(object):
headers = {
#注意: 不同电脑的key可能不一样
'authentication-key': ##key,
'content-type': 'application/json'
}
# 发送邮件返回requestId
@classmethod
def getRequestId(cls, appId, appName, sql):
url = 'https://query.api.kochava.com/async'
payload = {
"use_standard_sql": False,
"app_name" : appName,
"app_id" : appId,
"query" : sql,
"quota_ineligible": False,
"delivery_method" : ##,
"emails" : ##email,
"export" : True,
"large_results" : True,
"query_type" : "custom"
}
r = requests.post(url, data=json.dumps(payload), headers=cls.headers)
print("发送成功!")
return r.json()["request_id"]
该访问的参数我们可以通过下面的步骤进行获取
OK,由于 该网站内部已经有了发邮件这个接口功能 ,所以第一步访问网站发送邮件就结束了,可以查看邮箱中是否已经发送成功!
在这里,本人首先在个人的邮箱中创建了一个名为 Kochava的文件夹,所有只要是Kochava发来的邮件都会保存到该文件夹下,操作如下:
1,登录邮箱,我这边使用的是网易企业邮箱,新建一个文件夹。
2,设置来信分类。
3,简单设置一下,就OK了。
然后开始 coding
# 登录邮箱
@classmethod
def connectEmailByImap(cls):
hostname = ''
port =
username = ''
client = IMAPClient(hostname, port, ssl= True)
try:
#登录个人帐号
client.login(username, passwd)
print("登录成功!")
return client
except client.Error:
print('登录失败')
sys.exit(1)
显而易见,hostname 是对应的协议的服务器地址,port 就是端口号,uesrname 就比较有讲究了 ,它可以是邮箱登录的密码,当然也可以是客户端授权密码,如果获取客户端授权密码以及如何得到这个密码,后面会有相应的说明。不过,以上的这几个参数,不清楚的话可以访问 https://qiye.163.com/help/client-profile.html 网站进行查询。域名就是个人的邮箱,一般port用的都是加密的端口。我这里使用的是 IMAP协议。这一步也可能出现问题,所以如果遇到 SSL需要验证的,我们可以将其关闭,方式如下:
# 登录邮箱
@classmethod
def connectEmailByImap(cls):
hostname = 'imaphz.qiye.163.com'
port = 993
username = '*****'
passwd = '****'
# mac中验证ssl,将其设置为 False
ssl_context = ssl.create_default_context()
ssl_context.check_hostname = False
ssl_context.verify_mode = False
client = IMAPClient(hostname, port, ssl= True, ssl_context = ssl_context)
try:
#登录个人帐号
client.login(username, passwd)
print("登录成功!")
return client
except client.Error:
print('登录失败')
sys.exit(1)
下面就是获取密码了:
进入邮箱,在邮箱设置中 ,找到如下图的东西
开启客户端授权密码,此时会选择随机生成或者自定义,当然,个人觉得如果保密性不是很严格的话,使用自己邮箱登录密码即可
登录邮箱成功后,就要开始访问自己定义的文件夹中的邮件了,但是邮件是比较多的,如何才能获取想要的那个邮件呢? 1,使用request_id进行匹配 2,过滤今天的邮件 先上代码:
# 获取Url
@classmethod
def getUrl(cls,client,pre_request_id):
#1 进入kochava邮箱
client.select_folder('kochava', readonly=True)
#2 取今日邮件
#2.1 获取当前时间 类型是struct类型
localtime = time.localtime(time.time())
messages = client.search([u'SINCE', datetime(localtime.tm_year, localtime.tm_mon, localtime.tm_mday)])
#3 获取kochava文件夹中所有邮件及其内容
for uid,message_data in client.fetch(messages,"RFC822").items():
email_message = email.message_from_bytes(message_data[b"RFC822"])
#获取正文
print("=======================读取正文部分-===================")
for part in email_message.walk():
if part.get_content_type() == 'text/plain':
body = part.get_payload()
#根据“:”进行分割,去除所有奇怪符号 获取 request_id,status,url
request_id = body.split("Request ID:")[1].replace('\n', '').replace('\r', '').replace('=','').split(" ")[4]
print("获取的 requestId 为: ",request_id)
status = body.split("Status:")[1].replace('\n', '').replace('\r', '').replace('=','').split(" ")[3]
print("获取的 status 为: ",status)
url = body.split("Link:")[1].replace("=\r",'').replace("\n",'').replace("=3D",'=').split(" ")[4]
print("获取的 url 为: ",url)
# 匹配,如果是对应的request_id并且发送状态是成功的
if (request_id == pre_request_id) and (status == "Completed") and url is not None:
#返回 url
print("获取 url成功=======>",url)
return url
else:
print("可惜不是你,难受")
else:
email_message.get_payload(decode=True)
在邮件中,可以明显的看到有个 requestId,这个ID是偶然存在的吗?
当然不是,追溯之前的步骤,在首次访问 网页时,会有一个请求的返回结果
在这里,它也记录了这个requestId,所以,要判定哪一封邮件是我请求的,这个requestId就能帮忙找到。
OK,逻辑上解决了,下面就是挑几个函数上的使用:
首先,我们先看下获取到的邮件的消息长啥样:
总共拿到的有四个部分,第一部分是头,这部分数据是比较好拿的,两种方式都能获取
print("Subject1:",email_message["Subject"])
print("Subject2:",email_message.get("Subject"))
但是下面的数据就拿不到了,这里有个方法,walk(),这个方法可以循环获取到更深的数据,每次迭代都会拿到子部件,每个部件内容都有相应的类型,我们可以通过 下面的方法看一下。
print(part.get_content_type())
显而易见,我想要的就是中间那个类型的,那么我只要过滤出类型是 text/plain 这个的,就OK了,当然这也算一种取巧的方式了,如果有两个该类型,可能就需要重新写逻辑了。想了解更多方法,可以进去看看 https://docs.python.org/3/library/email.message.html ,肯定有更好的方法的。
到这,我们已经拿到了我们想要的URL,接下来就是拿着 URL去网上下载压缩包并解压了。
# 下载文件
@classmethod
def download_file_byurl(cls,url,new_name):
#获取文件的old_name
old_name = url.split("/")[3].split(".zip")[0]
filename = wget.detect_filename(url)
old_path = cls.file_path + "\\" + filename
new_name = cls.file_path + "\\" + new_name
print(filename)
wget.download(url,cls.file_path)
cls.unzip(filename,old_name,old_path,new_name)
@classmethod
def unzip(cls,filename,old_name,old_path,new_name):
#判断文件是否已经下载完成了,如果没有下载完成,等待10s后继续
if os.path.exists(cls.file_path + "\\" + filename):
print(cls.file_path + "\\" + filename)
old_name = cls.file_path + "\\" + old_name
zip_file = zipfile.ZipFile(old_path)
# 获取文件中的文件名
for names in zip_file.namelist():
zip_file.extract(names,cls.file_path)
zip_file.close()
os.rename(old_name,new_name)
#删除 压缩包
os.remove(old_path)
else:
print("请等待文件下载完成,时间10s")
threading.Thread(target=wait()).start()
cls.unzip(filename,old_path,new_name)
写到这,基本上文件能下载到本地并成功解压改名了,下面上全代码: 实现:Realize.py
import datetime
import threading
import time
from SendEmail import Email
from DownLoadFile import DownLoadFile
#线程等待
def wait():
time.sleep(30)
def get_url(client,request_id,new_name):
url = Email.getUrl(client,request_id)
if url is not None:
print("url===================>",url)
DownLoadFile.download_file_byurl(url,new_name=new_name)
else:
print("尚未检测到匹配的url,可能邮件还在发送中,请等待30s...")
threading.Thread(target=wait()).start()
get_url(client,request_id,new_name)
if __name__ == '__main__' :
#sql 和 appId 和 appName
today = datetime.date.today()
oneday = datetime.timedelta(days=1)
yesterday = today - oneday
sql = "select * from XXXX where date(date_received) = '" + str(yesterday) + "';"
appId = ####
appName = #####
#发送邮件获取request_id
request_id = Email.getRequestId(appId=appId,appName=appName,sql=sql)
#登录邮箱读取邮件获取 url
client = Email.connectEmailByImap()
#通过url下载文件并解压
day = yesterday.strftime("%Y%m%d")
new_name = "XXXX_" + str(day) + ".csv"
get_url(client,request_id,new_name)
SendEmail.py
import json
import sys
import time
import requests
import email
from imapclient import IMAPClient
from datetime import datetime
class Email(object):
headers = {
#注意: 不同电脑的key可能不一样
'authentication-key': ###,
'content-type': 'application/json'
}
# 发送邮件返回requestId
@classmethod
def getRequestId(cls, appId, appName, sql):
url = 'https://query.api.kochava.com/async'
payload = {
"use_standard_sql": False,
"app_name" : appName,
"app_id" : appId,
"query" : sql,
"quota_ineligible": False,
"delivery_method" : ####,
"emails" : ####,
"export" : True,
"large_results" : True,
"query_type" : "custom"
}
r = requests.post(url, data=json.dumps(payload), headers=cls.headers)
print("发送成功!")
return r.json()["request_id"]
# 登录邮箱
@classmethod
def connectEmailByImap(cls):
hostname = 'imaphz.qiye.163.com'
port = 993
username = ####
passwd = ####
client = IMAPClient(hostname, port, ssl= True)
try:
#登录个人帐号
client.login(username, passwd)
print("登录成功!")
return client
except client.Error:
print('登录失败')
sys.exit(1)
# 获取Url
@classmethod
def getUrl(cls,client,pre_request_id):
#1 进入kochava邮箱
client.select_folder('kochava', readonly=True)
#2 取今日邮件
#2.1 获取当前时间 类型是struct类型
localtime = time.localtime(time.time())
messages = client.search([u'SINCE', datetime(localtime.tm_year, localtime.tm_mon, localtime.tm_mday)])
#3 获取kochava文件夹中所有邮件及其内容
for uid,message_data in client.fetch(messages,"RFC822").items():
email_message = email.message_from_bytes(message_data[b"RFC822"])
#获取正文
print("=======================读取正文部分-===================")
for part in email_message.walk():
if part.get_content_type() == 'text/plain':
body = part.get_payload()
#根据“:”进行分割,去除所有奇怪符号 获取 request_id,status,url
request_id = body.split("Request ID:")[1].replace('\n', '').replace('\r', '').replace('=','').split(" ")[4]
print("获取的 requestId 为: ",request_id)
status = body.split("Status:")[1].replace('\n', '').replace('\r', '').replace('=','').split(" ")[3]
print("获取的 status 为: ",status)
url = body.split("Link:")[1].replace("=\r",'').replace("\n",'').replace("=3D",'=').split(" ")[4]
print("获取的 url 为: ",url)
# 匹配,如果是对应的request_id并且发送状态是成功的
if (request_id == pre_request_id) and (status == "Completed") and url is not None:
#返回 url
print("获取 url成功=======>",url)
return url
else:
print("可惜不是你,难受")
else:
email_message.get_payload(decode=True)
DownloadFile.py
import os
import threading
import time
import urllib.request
import zipfile
import wget
# coding by BradySuen
# time:2021/08/17
#线程等待
def wait():
time.sleep(10)
class DownLoadFile(object):
file_path = "E:\kochava数据1"
#解压文件
@classmethod
def unzip(cls,filename,old_name,old_path,new_name):
#判断文件是否已经下载完成了,如果没有下载完成,等待10s后继续
if os.path.exists(cls.file_path + "\\" + filename):
print(cls.file_path + "\\" + filename)
old_name = cls.file_path + "\\" + old_name
zip_file = zipfile.ZipFile(old_path)
# 获取文件中的文件名
for names in zip_file.namelist():
zip_file.extract(names,cls.file_path)
zip_file.close()
os.rename(old_name,new_name)
#删除 压缩包
os.remove(old_path)
else:
print("请等待文件下载完成,时间10s")
threading.Thread(target=wait()).start()
cls.unzip(filename,old_path,new_name)
# 下载文件
@classmethod
def download_file_byurl(cls,url,new_name):
#获取文件的old_name
old_name = url.split("/")[3].split(".zip")[0]
filename = wget.detect_filename(url)
old_path = cls.file_path + "\\" + filename
new_name = cls.file_path + "\\" + new_name
print(filename)
wget.download(url,cls.file_path)
cls.unzip(filename,old_name,old_path,new_name)
结束了吗?这才刚刚开始!
此处切换Java 上 pom 文件
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.11</artifactId>
<version>1.13.0</version>
</dependency>
<!-- flink kafka -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.13.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-base</artifactId>
<version>1.13.0</version>
</dependency>
<!-- json -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.75</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.49</version>
</dependency>
<dependency>
<groupId>cn.hutool</groupId>
<artifactId>hutool-all</artifactId>
<version>5.7.7</version>
</dependency>
<dependency>
<groupId>net.sf.opencsv</groupId>
<artifactId>opencsv</artifactId>
<version>2.3</version>
</dependency>
<dependency>
<groupId>org.json</groupId>
<artifactId>json</artifactId>
<version>20090211</version>
</dependency>
</dependencies>
类:
package com.nana.app.csv2kafka;
import cn.hutool.core.io.resource.ResourceUtil;
import cn.hutool.core.text.csv.CsvReader;
import cn.hutool.core.text.csv.CsvUtil;
import com.alibaba.fastjson.JSONObject;
import com.alibaba.fastjson.JSONValidator;
import com.nana.app.dto.ClickIdentifyDto;
import com.nana.app.dto.ImpressionIdentifyDto;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.List;
import java.util.Properties;
/**
* @author BradySuen
* @create_time 2021/8/9
* @description
**/
public class StreamClickIdentifyCSV2Kafka {
public static void main(String[] args) throws Exception {
//创建流式环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//路径
String path = "路径";
//过滤数据
DataStreamSource<String> resultDataStreamSource = env.addSource(new SourceFunction<String>() {
@Override
public void run(SourceContext<String> sourceContext) throws Exception {
//读取CSV文件并转化为对象
final CsvReader reader = CsvUtil.getReader();
List<ClickIdentifyDto> list = reader.read(ResourceUtil.getUtf8Reader(path), ClickIdentifyDto.class);
//遍历数据将数据转化为JSON
for (ClickIdentifyDto clickIdentifyDto : list) {
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
clickIdentifyDto.partition_date = sdf.format(
new Date(Long.parseLong(clickIdentifyDto.date_received.substring(0,10)) * 1000L));
//将对象转换为JSONString
String result = JSONObject.toJSONString(clickIdentifyDto);
sourceContext.collect(result);
}
}
@Override
public void cancel() {
}
});
resultDataStreamSource.print();
//输出数据到KAFKA
Properties prop = new Properties();
prop.setProperty("bootstrap.servers","###");
resultDataStreamSource.addSink(new
FlinkKafkaProducer<String>("topic", new SimpleStringSchema(),prop));
env.execute();
}
}
ClickIdentifyDto:
package com.nana.app.dto;
/**
* @author BradySuen
* @create_time 2021/8/11
* @description
**/
public class ClickIdentifyDto {
public String click_id;
public String date_received;
public String identifier_type;
public String identifier_value;
public String partition_date;
}
package com.nana.app.kafka2doris;
import com.nana.app.dto.ClickIdentifyDto;
import com.nana.app.dto.InstallInfluencersDto;
import java.lang.reflect.Field;
import java.sql.*;
/**
* @author BradySuen
* @create_time 2021/8/11
* @description
**/
public class ClickIdentify2Doris {
public static void main(String[] args) throws ClassNotFoundException, SQLException {
String URL="jdbc:mysql://<IP>:<port 默认9030>/<库名>?useUnicode=true&characterEncoding=utf-8";
String USER="##";
String PASSWORD="##";
//1.加载驱动程序
Class.forName("com.mysql.jdbc.Driver");
//2.获得数据库链接
Connection conn= DriverManager.getConnection(URL, USER, PASSWORD);
//3.通过数据库的连接操作数据库,实现增删改查(使用Statement类)
Statement st=conn.createStatement();
//反射获取属性(即列)
Field[] fields = ClickIdentifyDto.class.getDeclaredFields();
int len = fields.length;
//拼接属性
StringBuilder column = new StringBuilder();
for (Field field : fields) {
if(len == fields.length){
column.append("`").append(field.getName()).append("`");
}else{
column.append(",").append("`").append(field.getName()).append("`");
}
len --;
}
//创建Route load
String jobName = "kochava_db.query_click_identify";
String tableName = "kochava_click_identifiers";
StringBuffer task = new StringBuffer();
task.append("CREATE ROUTINE LOAD ").append(jobName).append(" ON ").append(tableName).append("\n")
.append("COLUMNS(").append(column).append(")").append("\n")
.append("PROPERTIES (").append("\n")
.append("\t").append("\"desired_concurrent_number\"=\"1\"").append(",").append("\n")
.append("\t").append("\"max_batch_interval\"=\"10\"").append(",").append("\n")
.append("\t").append("\"max_batch_rows\"=\"300000\"").append(",").append("\n")
.append("\t").append("\"max_batch_size\"=\"209715200\"").append(",").append("\n")
.append("\t").append("\"strict_mode\"=\"false\"").append(",").append("\n")
.append("\t").append("\"format\"=\"json\"").append("\n")
.append(")").append("\n")
.append("From KAFKA (").append("\n")
.append("\t").append("\"kafka_broker_list\"=\"XXXX:9092\"").append(",").append("\n")
.append("\t").append("\"kafka_topic\"=\"topic\"").append(",").append("\n")
.append("\t").append("\"property.kafka_default_offsets\"=\"OFFSET_END\"").append(",").append("\n")
.append("\t").append("\"property.group.id\"=\"gid_click_identify\"").append("\n")
.append(");");
st.execute(task.toString());
//关闭资源
st.close();
conn.close();
}
}