一、Azkaban 的安装及配置

1.1 环境准备

1.1.1 数据库准备


  1. 将安装包上传到指定目录/opt/software/azkaban
    大数据-azkaban全流程调度(六)_xml
    大数据-azkaban全流程调度(六)_hadoop_02
  2. 解压
    大数据-azkaban全流程调度(六)_mysql_03

  • 将db文件解压,里面有个all相关的sql:
    大数据-azkaban全流程调度(六)_xml_04
    将sql文件导入到数据库:
    大数据-azkaban全流程调度(六)_xml_05

1.1.2 azkaban的服务端配置


  1. 将azkaban-exec的那个压缩包解压到:
    大数据-azkaban全流程调度(六)_数据_06
  2. 修改 azkaban.properties 文件

# Azkaban Personalization Settings
azkaban.name=Test
azkaban.label=My Local Azkaban
azkaban.color=#FF3601
azkaban.default.servlet.path=/index
web.resource.dir=web/
default.timezone.id=Asia/Shanghai
# Azkaban UserManager class
user.manager.class=azkaban.user.XmlUserManager
user.manager.xml.file=conf/azkaban-users.xml
# Loader for projects
executor.global.properties=conf/global.properties
azkaban.project.dir=projects
# Velocity dev mode
velocity.dev.mode=false
# Azkaban Jetty server properties.
jetty.use.ssl=false
jetty.maxThreads=25
jetty.port=8081
# Where the Azkaban web server is located
azkaban.webserver.url=http://hadoop102:8081
# mail settings
mail.sender=
mail.host=
# User facing web server configurations used to construct the user facing server URLs. They are useful when there is a reverse proxy between Azkaban web servers and users.
# enduser -> myazkabanhost:443 -> proxy -> localhost:8081
# when this parameters set then these parameters are used to generate email links.
# if these parameters are not set then jetty.hostname, and jetty.port(if ssl configured jetty.ssl.port) are used.
# azkaban.webserver.external_hostname=myazkabanhost.com
# azkaban.webserver.external_ssl_port=443
# azkaban.webserver.external_port=8081
job.failure.email=
job.success.email=
lockdown.create.projects=false
cache.directory=cache
# JMX stats
jetty.connector.stats=true
executor.connector.stats=true
# Azkaban plugin settings
azkaban.jobtype.plugin.dir=plugins/jobtypes
# Azkaban mysql settings by default. Users should configure their own username and password.
database.type=mysql
mysql.port=3306
mysql.host=192.168.109.135
mysql.database=azkaban
mysql.user=azkaban
mysql.password=azkaban
mysql.numconnections=100
# Azkaban Executor settings
executor.maxThreads=50
executor.flow.threads=30
executor.port=12321

进入到exec的安装目录(其配置文件中很多都是相对路径):

bin/starte-exec.sh

注意:如果mysql的版本是8以上,则需要去 lib 目录下将默认的 5.1.28的mysql驱动版本删除,然后在自己加入8的驱动版本就可以了

  1. 激活azkaban:
curl -G "hadoop102:12321/executor?action=activate" && echo
  1. 激活后查看数据库
    大数据-azkaban全流程调度(六)_数据_07
    0:未激活
    1:已激活

1.1.3 azkaban的web端配置


  1. 解压到与server端同一目录下
    大数据-azkaban全流程调度(六)_hadoop_08
  2. 依旧是修改azkaban.properties

# Azkaban Personalization Settings
azkaban.name=Test
azkaban.label=My Local Azkaban
azkaban.color=#FF3601
azkaban.default.servlet.path=/index
web.resource.dir=web/
default.timezone.id=Asia/Shanghai
# Azkaban UserManager class
user.manager.class=azkaban.user.XmlUserManager
user.manager.xml.file=conf/azkaban-users.xml
# Loader for projects
executor.global.properties=conf/global.properties
azkaban.project.dir=projects
# Velocity dev mode
velocity.dev.mode=false
# Azkaban Jetty server properties.
jetty.use.ssl=false
jetty.maxThreads=25
jetty.port=8081
# Azkaban Executor settings
# mail settings
mail.sender=1449697757@qq.com
mail.host=smtp.qq.com
mail.user=1449697757@qq.com
mail.password=xrkaryjkftmxgaec
# User facing web server configurations used to construct the user facing server URLs. They are useful when there is a reverse proxy between Azkaban web servers and users.
# enduser -> myazkabanhost:443 -> proxy -> localhost:8081
# when this parameters set then these parameters are used to generate email links.
# if these parameters are not set then jetty.hostname, and jetty.port(if ssl configured jetty.ssl.port) are used.
# azkaban.webserver.external_hostname=myazkabanhost.com
# azkaban.webserver.external_ssl_port=443
# azkaban.webserver.external_port=8081
job.failure.email=
job.success.email=
lockdown.create.projects=false
cache.directory=cache
# JMX stats
jetty.connector.stats=true
executor.connector.stats=true
# Azkaban mysql settings by default. Users should configure their own username and password.
database.type=mysql
mysql.port=3306
mysql.host=192.168.109.135
mysql.database=azkaban
mysql.user=azkaban
mysql.password=azkaban
mysql.numconnections=100
#Multiple Executor
azkaban.use.multiple.executors=true
azkaban.executorselector.filters=StaticRemainingFlowSize,CpuStatus
azkaban.executorselector.comparator.NumberOfAssignedFlowComparator=1
azkaban.executorselector.comparator.Memory=1
azkaban.executorselector.comparator.LastDispatched=1
azkaban.executorselector.comparator.CpuUsage=1
  1. 修改用户:
vim azkaban-users.xml


<azkaban-users>
<user groups="azkaban" password="azkaban" roles="admin" username="azkaban"/>
<user password="metrics" roles="metrics" username="metrics"/>

<user password="root" roles="admin" username="root"/>

<role name="admin" permissions="ADMIN"/>
<role name="metrics" permissions="METRICS"/>
</azkaban-users>

二、azkaban的基本使用

2.1 编写测试文件

first.project ```project azkaban-flow-version: 2.0 ``` first.flow ```yaml nodes:   - name: jobA     type: command     config:       command: echo "Hello World"  ``` 

2.2 将这个两个文件压缩成一个 .zip包

然后创建一个项目:

大数据-azkaban全流程调度(六)_hadoop_09

将zip包上传

大数据-azkaban全流程调度(六)_数据_10

然后点执行

大数据-azkaban全流程调度(六)_数据_11

以上是比较简单的流程

下面比较复杂的(存在依赖关系):

nodes:
- name: jobA
type: command
config:
command: echo "Hello World AAA"
- name: jobB
type: command
config:
command: echo "Hello World BBB"
- name: jobC
type: command
config:
command: echo "Hello World CCC"
dependsOn:
- jobA
- jobB

执行后的结果:

大数据-azkaban全流程调度(六)_mysql_12

2.3 azkaban的邮件报警功能:

  1. 配置邮箱
    大数据-azkaban全流程调度(六)_mysql_13
    在web的properties文件中配置:
# mail settings
mail.sender=1449697757@qq.com
mail.host=smtp.qq.com
mail.user=1449697757@qq.com
mail.password=xxxxxxxxxxxx
  1. 重启azkaban的web服务:然后在下面配置发送邮箱
    大数据-azkaban全流程调度(六)_mysql_14

2.4 azkaban的启停脚本

#!/bin/bash
start-web(){
for i in hadoop102; do
ssh $i "cd /opt/module/azkaban/azkaban-web/ ; bin/start-web.sh"
done
}

stop-web(){
for i in hadoop102; do
ssh $i "cd /opt/module/azkaban/azkaban-web/ ; bin/shutdown-web.sh"
done
}

start-exec(){
for i in hadoop102 hadoop103 hadoop104; do
ssh $i "cd /opt/module/azkaban/azkaban-exec/ ; bin/start-exec.sh"
done
}

activate-exec(){
for i in hadoop102 hadoop103 hadoop104; do
ssh $i "curl -G '$i:12321/executor?action=activate' && echo"
done
}

stop-exec(){
for i in hadoop102 hadoop103 hadoop104; do
ssh $i "/opt/module/azkaban/azkaban-exec/bin/shutdown-exec.sh"
done
}

case $1 in
start-exec )
start-exec
;;
a-exec )
activate-exec
;;
stop-exec )
stop-exec
;;
start-web )
start-web
;;
stop-web )
stop-web
;;
esac

三、azkaban调度全流程

3.1 准备数据

3.1.1 日志数据

(1)修改/opt/module/applog 下的 application.properties

#业务日期
mock.date=2020-06-20

注意:分发至其他需要生成数据的节点

[root@hadoop102 applog]$ xsync application.properties

(2)生成数据

[root@hadoop102 bin]$ lg.sh

注意:生成数据之后,记得查看 HDFS 数据是否存在!

(3)观察 HDFS 的/origin_data/gmall/log/topic_log/2020-06-26 路径是否有数据

大数据-azkaban全流程调度(六)_xml_15

3.1.2 业务数据准备

(1)修改/opt/module/db_log 下的 application.properties

mock.date=2020-06-20

(2)生成数据

[root@hadoop102 db_log]$ java -jar gmall2020-mock-db-2020-04-01.jar

(3)观察 SQLyog 中 order_infor 表中 operate_time 中有 2020-06-26 日期的数据

大数据-azkaban全流程调度(六)_hadoop_16

3.2 开始调度

3.1 编写配置文件

gmall.project

azkaban-flow-version: 2.0

gamll.flow

nodes:
- name: mysql_to_hdfs
type: command
config:
command: /usr/bin/mysql_to_hdfs.sh all ${dt}

- name: hdfs_to_ods_log
type: command
config:
command: /usr/bin/hdfs_to_ods_log.sh ${dt}

- name: hdfs_to_ods_db
type: command
dependsOn:
- mysql_to_hdfs
config:
command: /usr/bin/hdfs_to_ods_db.sh all ${dt}

- name: ods_to_dwd_log
type: command
dependsOn:
- hdfs_to_ods_log
config:
command: /usr/bin/ods_to_dwd_log.sh ${dt}

- name: ods_to_dwd_db
type: command
dependsOn:
- hdfs_to_ods_db
config:
command: /usr/bin/ods_to_dwd_db.sh all ${dt}

- name: dwd_to_dws
type: command
dependsOn:
- ods_to_dwd_log
- ods_to_dwd_db
config:
command: /usr/bin/dwd_to_dws.sh ${dt}

- name: dws_to_dwt
type: command
dependsOn:
- dwd_to_dws
config:
command: /usr/bin/dws_to_dwt.sh ${dt}

- name: dwt_to_ads
type: command
dependsOn:
- dws_to_dwt
config:
command: /usr/bin/dwt_to_ads.sh ${dt}

- name: hdfs_to_mysql
type: command
dependsOn:
- dwt_to_ads
config:
command: /usr/bin/hdfs_to_mysql.sh all

然后将这个两个文件压缩成一个 gmall.zip包,上传

3.2 web端执行操作

大数据-azkaban全流程调度(六)_mysql_17

大数据-azkaban全流程调度(六)_xml_18

大数据-azkaban全流程调度(六)_mysql_19

这里可以看所有调度任务

大数据-azkaban全流程调度(六)_xml_20

然后整个流程就调度完毕:

大数据-azkaban全流程调度(六)_mysql_21