azkaban错误排查日志

一.错误一

1.azkaban运行kettle任务时报错:
02-08-2018 14:13:15 CST EXT_T_PMS_HR_EMPLOYEE_PMS INFO - 2018/08/02 14:13:15 - writeData.0 - Error updating batch
02-08-2018 14:13:15 CST EXT_T_PMS_HR_EMPLOYEE_PMS INFO - 2018/08/02 14:13:15 - writeData.0 - Column 'en_name' cannot be null
02-08-2018 14:13:15 CST EXT_T_PMS_HR_EMPLOYEE_PMS INFO - 2018/08/02 14:13:15 - writeData.0 -
02-08-2018 14:13:15 CST EXT_T_PMS_HR_EMPLOYEE_PMS INFO - 2018/08/02 14:13:15 - writeData.0 - at org.pentaho.di.core.database.Database.createKettleDatabaseBatchException(Database.java:1425)
02-08-2018 14:13:15 CST EXT_T_PMS_HR_EMPLOYEE_PMS INFO - 2018/08/02 14:13:15 - writeData.0 - at org.pentaho.di.core.database.Database.emptyAndCommit(Database.java:1414)
02-08-2018 14:13:15 CST EXT_T_PMS_HR_EMPLOYEE_PMS INFO - 2018/08/02 14:13:15 - writeData.0 - at org.pentaho.di.trans.steps.tableoutput.TableOutput.dispose(TableOutput.java:586)
02-08-2018 14:13:15 CST EXT_T_PMS_HR_EMPLOYEE_PMS INFO - 2018/08/02 14:13:15 - writeData.0 - at org.pentaho.di.trans.step.RunThread.run(RunThread.java:97)
02-08-2018 14:13:15 CST EXT_T_PMS_HR_EMPLOYEE_PMS INFO - 2018/08/02 14:13:15 - writeData.0 - at java.lang.Thread.run(Thread.java:748)

这个错误的原因是:运行kettle时,kettle将​​''​​自动识别为​​null​​,继续按照null插入表的时候,就报上述的错。但是我已经在​​kettle.properties​​文件中做修改(如果这里有不了解的同学,可以私聊问我),但仍然报null错,我就郁闷了。

我们知道​​kettle.properties​​这个文件在每个运行kettle的用户的主目录中都会存在,我修改的只是用户​​guaishou​​下的​​kettle.properties​​。却没有修改root用户下的​​kettle.properties​​文件【或者说该​​guaishou​​用户根本没有​​kettle.properties​​这个文件】。这时候,或许就有同学说,那你直接修改这个root用户主目录下的​​kettle.properties​​文件补救解决问题了吗?这么做是可以的,但是我们并不应该绕开这个问题。

对于每个问题,都需要思考一下,再给出行动方案。

我在guaishou的用户,在其下运行azkaban的job时,却没有抛出这个错误,这是为何?这么来看,任务都是对的,错误的原因可能起自:azkaban。但是azkaban又会是什么错呢?

原因1:azkaban以不同的身份运行这个任务,从而导致出错。与这个问题相似的也有​​No such directory​​错。

原因2:azkaban会调度任务,会将任务分配到不同的executor上执行,当有些节点没有布置这个kettle时,就会报​​null​​错,这是因为在两个azkaban可运行的机器中,没有相同的文件目录。

我们可以通过azkaban数据库查找executor的具体信息,如下:

mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| sys_azkaban |
+--------------------+
2 rows in set (0.00 sec)

mysql> use sys_azkaban;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| sys_azkaban |
+--------------------+
2 rows in set (0.00 sec)

mysql> show tables;
+--------------------------+
| Tables_in_sys_azkaban |
+--------------------------+
| ···· |
| executors |
··· |
+--------------------------+
28 rows in set (0.00 sec)

mysql> select * from executors;
+----+----------+-------+--------+
| id | host | port | active |
+----+----------+-------+--------+
| 12 | bi_5.109 | 12321 | 1 |
| 16 | bi_5.110 | 12321 | 1 |
+----+----------+-------+--------+
1 row in set (0.00 sec)

可以看到这个bi_5.109也是一个active的executor。这个​​null​​的问题,以及脚本找不到的问题就是源自这个executor下。因为有两个executor,但是我的执行脚本只在​​bi_5.110​​这台机器上,所以导致出现​​No such directory​​。可以查看一下出现该错的日志信息:

azkaban错误排查日志_mysql

可以看到是在bi_5.109上执行该任务,所以报错。

二.错误二

2.1 azkaban运行kettle任务报错(错误描述:每当运行任务,就出现faild)。azkaban后台报错:
2018-08-02 15:34:04 INFO  ExecutorManager:265 - Successfully refreshed executor: bi_5.110:12321 (id: 16) with executor info : ExecutorInfo{remainingMemoryPercent=96.25209633680396, remainingMemoryInMB=30812, remainingFlowCapacity=30, numberOfAssignedFlows=0, lastDispatchedTime=1533192900505, cpuUsage=0.0}
2018-08-02 15:34:04 INFO ExecutorManager:1813 - Using dispatcher for execution id :4222
2018-08-02 15:34:04 ERROR ExecutorManager:1392 - Rolling back executor assignment for execution id:4222
azkaban.executor.ExecutorManagerException: java.io.IOException: java.nio.file.FileSystemException: executions/4222/azkabanJob/batch101/batch101.job -> /data/software/azkaban/exec/projects/7.5/azkabanJob/batch101/batch101.job: Operation not permitted
at azkaban.executor.ExecutorApiGateway.callWithExecutionId(ExecutorApiGateway.java:78)
at azkaban.executor.ExecutorApiGateway.callWithExecutable(ExecutorApiGateway.java:43)
at azkaban.executor.ExecutorManager.dispatch(ExecutorManager.java:1389)
at azkaban.executor.ExecutorManager.access$1500(ExecutorManager.java:65)
at azkaban.executor.ExecutorManager$QueueProcessorThread.selectExecutorAndDispatchFlow(ExecutorManager.java:1750)
at azkaban.executor.ExecutorManager$QueueProcessorThread.processQueuedFlows(ExecutorManager.java:1730)
at azkaban.executor.ExecutorManager$QueueProcessorThread.run(ExecutorManager.java:1668)

查看azkaban下的executor的权限,如下:

[root@bi_5 projects]# ll
total 0
drwxr-xr-x 2 guaishou guaishou 61 Jul 27 14:09 1.1
drwxrwxr-x 2 guaishou guaishou 21 Jul 27 14:09 11.1
drwxr-xr-x 2 guaishou guaishou 109 Jul 27 14:09 1.2
drwxr-xr-x 2 guaishou guaishou 19 Jul 27 14:09 1.3
drwxrwxr-x 2 guaishou guaishou 39 Aug 2 14:55 13.1
drwxrwxr-x 2 guaishou guaishou 24 Jul 27 14:09 2.1
drwxr-xr-x 2 guaishou guaishou 23 Jul 27 14:09 3.1
drwxr-xr-x 2 root root 39 Jul 27 20:11 4.1
drwxrwxr-x 2 guaishou guaishou 24 Jul 27 14:09 5.1
drwxr-xr-x 2 root root 43 Jul 30 15:02 6.1
drwxr-xr-x 3 root root 23 Jul 31 15:29 7.1
drwxr-xr-x 3 root root 23 Aug 1 17:36 7.3
drwxr-xr-x 3 root root 23 Aug 2 10:46 7.5
drwxrwxr-x 2 guaishou guaishou 24 Jul 27 14:09 8.1
[root@bi_5 projects]# cd ..
[root@bi_5 exec]# cd ..
[root@bi_5 azkaban]# ll
total 0
drwxr-xr-x 11 guaishou guaishou 155 Aug 2 14:31 exec

修改./exec/下的所有情况的文件权限并如下所示:
[root@bi_5 azkaban]# chown guaishou.guaishou ./exec/ -R
[root@bi_5 azkaban]# cd exec/
[root@bi_5 exec]# ll
total 12
drwxr-xr-x 2 guaishou guaishou 107 Jul 27 14:09 bin
drwxr-xr-x 2 guaishou guaishou 78 Jul 27 17:35 conf
-rw-r--r-- 1 guaishou guaishou 6 Aug 2 14:31 currentpid
drwxrwsr-x 3 guaishou guaishou 17 Aug 2 15:34 executions
-rw-r--r-- 1 guaishou guaishou 6 Aug 2 14:31 executor.port
drwxr-xr-x 2 guaishou guaishou 6 Jul 27 14:09 extlib
drwxr-xr-x 2 guaishou guaishou 4096 Jul 27 14:09 lib
drwxr-xr-x 2 guaishou guaishou 89 Jul 27 14:09 logs
drwxr-xr-x 3 guaishou guaishou 21 Jul 27 14:09 plugins
drwxr-xr-x 16 guaishou guaishou 148 Aug 2 14:55 projects
drwxr-xr-x 2 guaishou guaishou 6 Aug 2 14:55 temp

Web UI报错页面如下所示:

azkaban错误排查日志_.net_02

这个粉红色的页面,正是因为没有权限运行这些任务导致。按照上述修改即可。

三.错误三

3.1报错
24-10-2018 11:38:49 CST test INFO - Starting job test at 1540352329279
24-10-2018 11:38:49 CST test INFO - azkaban.webserver.url property was not set
24-10-2018 11:38:49 CST test INFO - job JVM args: -Dazkaban.flowid=test -Dazkaban.execid=1655119 -Dazkaban.jobid=test
24-10-2018 11:38:49 CST test INFO - Building command job executor.
24-10-2018 11:38:49 CST test INFO - 1 commands to execute.
24-10-2018 11:38:49 CST test INFO - effective user is: guaishou
24-10-2018 11:38:49 CST test INFO - Command: /wls/applications/azkaban/azkaban-exec-server-3.8.0/lib/execute-as-user guaishou echo "hello"
24-10-2018 11:38:49 CST test INFO - Environment variables: {JOB_OUTPUT_PROP_FILE=/data/app/ak/execution/1655119/test/test_output_8459281760235729997_tmp, JOB_PROP_FILE=/data/app/ak/execution/1655119/test/test_props_1267879958234978530_tmp, JOB_NAME=test}
24-10-2018 11:38:49 CST test INFO - Working directory: /data/app/ak/execution/1655119/test
24-10-2018 11:38:49 CST test INFO - Process completed unsuccessfully in 0 seconds.
24-10-2018 11:38:49 CST test ERROR - Job run failed!
java.lang.RuntimeException: java.io.IOException: Cannot run program "/wls/applications/azkaban/azkaban-exec-server-3.8.0/lib/execute-as-user" (in directory "/data/app/ak/execution/1655119/test"): error=2, No such file or directory
at azkaban.jobExecutor.ProcessJob.run(ProcessJob.java:164)
at azkaban.execapp.JobRunner.runJob(JobRunner.java:596)
at azkaban.execapp.JobRunner.run(JobRunner.java:446)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Cannot run program "/wls/applications/azkaban/azkaban-exec-server-3.8.0/lib/execute-as-user" (in directory "/data/app/ak/execution/1655119/test"): error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at azkaban.jobExecutor.utils.process.AzkabanProcess.run(AzkabanProcess.java:92)
at azkaban.jobExecutor.ProcessJob.run(ProcessJob.java:158)
... 7 more
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 9 more
24-10-2018 11:38:49 CST test ERROR - java.io.IOException: Cannot run program "/wls/applications/azkaban/azkaban-exec-server-3.8.0/lib/execute-as-user" (in directory "/data/app/ak/execution/1655119/test"): error=2, No such file or directory cause: java.io.IOException: Cannot run program "/wls/applications/azkaban/azkaban-exec-server-3.8.0/lib/execute-as-user" (in directory "/data/app/ak/execution/1655119/test"): error=2, No such file or directory
24-10-2018 11:38:49 CST test INFO - Finishing job test attempt: 0 at 1540352329300 with status FAILED

看到这里最关键的报错:

​Cannot run program "/wls/applications/azkaban/azkaban-exec-server-3.8.0/lib/execute-as-user" (in directory "/data/app/ak/execution/1655119/test"): error=2, No such file or directory​

即需要执行的execute文件不存在。

这里有两种方法可以修改这个问题:

  • 方法一
[gs@cdh208 azkaban-exec-server-3.8.0]$ ll
total 0
lrwxrwxrwx 1 gs gs 52 Oct 24 12:00 lib -> /data/software/azkaban/azkaban-exec-server-3.8.0/lib
[gs@cdh208 azkaban-exec-server-3.8.0]$ pwd
/wls/applications/azkaban/azkaban-exec-server-3.8.0

可以看到这里的​​/wls/applications/azkaban/azkaban-exec-server-3.8.0​​下有个链接从​​lib​​ -> ​​/data/software/azkaban/azkaban-exec-server-3.8.0/lib​​。

当然,这种方式修改的azkaban是有极大冗余的,而且极其丑陋!我们需要知道这个​​/wls/application/azkaban/azkaban-exec-server-3.8.0/lib​​到底是什么玩意儿。

  • 方法二
    在azkaban目录下使用grep命令查找包含wls的字符串:
[gs@cdh208 azkaban-exec-server-3.8.0]$  grep -r  "wls" 
Binary file lib/mysql-connector-java-5.1.28.jar matches
plugins/jobtypes/commonprivate.properties:azkaban.native.lib=/wls/applications/azkaban/azkaban-exec-server-3.8.0/lib

发现存在一个​​plugins/jobtypes/commonprivate.properties​​的文件,这个文件中包含有一个该lib配置。

查看该文件如下:

[gs@cdh208 jobtypes]$ cat commonprivate.properties.bak.20181024 
execute.as.user=true
azkaban.native.lib=/wls/applications/azkaban/azkaban-exec-server-3.8.0/lib

所以我们可以将其修改成如下的样子:

[gs@cdh208 jobtypes]$ cat commonprivate.properties
execute.as.user=true
azkaban.native.lib=/data/software/azkaban/azkaban-exec-server-3.8.0/lib

再次执行该任务,发现报错如下。

四.错误四

4.1 报错如下
24-10-2018 12:14:27 CST test INFO - Starting job test at 1540354467702
24-10-2018 12:14:27 CST test INFO - azkaban.webserver.url property was not set
24-10-2018 12:14:27 CST test INFO - job JVM args: -Dazkaban.flowid=test -Dazkaban.execid=1655124 -Dazkaban.jobid=test
24-10-2018 12:14:27 CST test INFO - Building command job executor.
24-10-2018 12:14:27 CST test INFO - 1 commands to execute.
24-10-2018 12:14:27 CST test INFO - effective user is: guaishou
24-10-2018 12:14:27 CST test INFO - Command: /wls/applications/azkaban/azkaban-exec-server-3.8.0/lib/execute-as-user guaishou echo "hello"
24-10-2018 12:14:27 CST test INFO - Environment variables: {JOB_OUTPUT_PROP_FILE=/data/app/ak/execution/1655124/test/test_output_1285731661769754294_tmp, JOB_PROP_FILE=/data/app/ak/execution/1655124/test/test_props_1666794928458588286_tmp, JOB_NAME=test}
24-10-2018 12:14:27 CST test INFO - Working directory: /data/app/ak/execution/1655124/test
24-10-2018 12:14:27 CST test INFO - user does not exist: guaishou
24-10-2018 12:14:27 CST test INFO - Process completed unsuccessfully in 0 seconds.
24-10-2018 12:14:27 CST test ERROR - Job run failed!
java.lang.RuntimeException: azkaban.jobExecutor.utils.process.ProcessFailureException
at azkaban.jobExecutor.ProcessJob.run(ProcessJob.java:164)
at azkaban.execapp.JobRunner.runJob(JobRunner.java:596)
at azkaban.execapp.JobRunner.run(JobRunner.java:446)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: azkaban.jobExecutor.utils.process.ProcessFailureException
at azkaban.jobExecutor.utils.process.AzkabanProcess.run(AzkabanProcess.java:131)
at azkaban.jobExecutor.ProcessJob.run(ProcessJob.java:158)
... 7 more
24-10-2018 12:14:27 CST test ERROR - azkaban.jobExecutor.utils.process.ProcessFailureException cause: azkaban.jobExecutor.utils.process.ProcessFailureException
24-10-2018 12:14:27 CST test INFO - Finishing job test attempt: 0 at 1540354467720 with status FAILED
4.2报错原因

发现报错的原因是​​user does not exist: guaishou​​​,即不存在​​guaishou​​​这个用户。我们的azkaban登录用户是​​guaishou​​​,但是为什么这里却报​​guaishou​​​不存在呢?在azkaban执行的机器上查看所有​​guaishou​​这个用户,如下:

[root@bi_5 ~]# su - guaishou
su: user guaishou does not exist

然后手动为该系统添加guaishou用户,如下:

[root@bi_5 ~]# useradd guaishou

最后执行任务,成功。

4.3 思考

根据上面的这个操作,我们知道,**如果azkaban执行命令的用户并不存在于azkaban executor所在的系统上,那么这个任务就是不会运行成功的。**这个命题的成立是有前提的,我们查看​​commonprivate.properties​​文件

[gs@cdh208 jobtypes]$ cat commonprivate.properties
execute.as.user=true
azkaban.native.lib=/data/software/azkaban/azkaban-exec-server-3.8.0/lib

可以看到里面有个配置选项​​execute.as.user=true​​,这个配置代表的意思就是,azkaban中可以使用非登录用户执行任务。

如下图所示:

azkaban错误排查日志_java_03

这里azkaban的登录用户是guaishou,但是我们使用执行test任务的用户是gs,可以这么做的前提是**【需要针对azkaban的底层源码做一些修改,才可以配置此选项】**。

五. 错误五

5.1 报错如下:
···
2018-10-31 10:59:44 INFO XmlUserManager:321 - Group roles GsBigDataDeveloper added.
2018-10-31 10:59:44 INFO ExecutorManager:199 - Initializing multi executors from database
2018-10-31 10:59:44 ERROR ExecutorManager:221 - No active executor found
2018-10-31 10:59:44 ERROR StdOutErrRedirect:55 - Exception in thread "main"
2018-10-31 10:59:44 ERROR StdOutErrRedirect:55 - com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Error injecting constructor, azkaban.executor.ExecutorManagerException: No active executor found
at azkaban.executor.ExecutorManager.<init>(ExecutorManager.java:121)
at azkaban.executor.ExecutorManager.class(ExecutorManager.java:65)
while locating azkaban.executor.ExecutorManager
for the 3rd parameter of azkaban.webapp.AzkabanWebServer.<init>(AzkabanWebServer.java:161)
at azkaban.webapp.AzkabanWebServer.class(AzkabanWebServer.java:119)
while locating azkaban.webapp.AzkabanWebServer
Caused by: azkaban.executor.ExecutorManagerException: No active executor found
at azkaban.executor.ExecutorManager.setupExecutors(ExecutorManager.java:222)
at azkaban.executor.ExecutorManager.<init>(ExecutorManager.java:127)
at azkaban.executor.ExecutorManager$$FastClassByGuice$$e1c1dfed.newInstance(<generated>)
at com.google.inject.internal.DefaultConstructionProxyFactory$FastClassProxy.newInstance(DefaultConstructionProxyFactory.java:89)
at com.google.inject.internal.ConstructorInjector.provision(ConstructorInjector.java:111)
···

很明显是因为没有找到active azkaban executor。所以需要启动一下executor,然后去azkaban使用的mysql数据库中的executors表中查看是否存在active = 1的executor,如果没有,更新即可。

mysql> select * from executors;
+----+-----------+-------+--------+
| id | host | port | active |
+----+-----------+-------+--------+
| 3 | server4 | 33871 | 0 |
+----+-----------+-------+--------+
1 rows in set (0.00 sec)

​update executors set active =1 where id = 3 ;​

六. 错误六

6.1 报错如下
31-10-2018 22:38:08 PDT test INFO - Starting job test at 1541050688024
31-10-2018 22:38:08 PDT test INFO - job JVM args: -Dazkaban.flowid=test -Dazkaban.execid=9 -Dazkaban.jobid=test
31-10-2018 22:38:08 PDT test INFO - user.to.proxy property was not set, defaulting to submit user azkaban
31-10-2018 22:38:08 PDT test INFO - Building command job executor.
31-10-2018 22:38:08 PDT test INFO - Cannot request memory (Xms 0 kb, Xmx 0 kb) from system for job test, sleep for 60 secs and retry, attempt 1 of 720
31-10-2018 22:39:08 PDT test INFO - Cannot request memory (Xms 0 kb, Xmx 0 kb) from system for job test, sleep for 60 secs and retry, attempt 2 of 720
·····
6.2 报错原因

azkaban默认执行任务需要3GB内存,因为这个任务运行在我的(2GB)虚拟机上,所以报错,修改executor下的​​plugin/jobtype/commonprivate.properties​​​文件,添加​​memCheck.enabled=false​​即可

[root@server4 jobtypes]# cat commonprivate.properties 
# set execute-as-user
execute.as.user=false
memCheck.enabled=false

重启executor即可运行成功。

七.错误七

2018/11/06 12:10:45.329 +0800 ERROR [ExecutorManager] [Azkaban] Failed to update ExecutorInfo for executor : 10.22.6.110:12321 (id: 4)
java.util.concurrent.ExecutionException: java.net.NoRouteToHostException: No route to host (Host unreachable)
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:206)
at azkaban.executor.ExecutorManager.refreshExecutors(ExecutorManager.java:260)
at azkaban.executor.ExecutorManager.access$1500(ExecutorManager.java:66)
at azkaban.executor.ExecutorManager$QueueProcessorThread.processQueuedFlows(ExecutorManager.java:1825)
at azkaban.executor.ExecutorManager$QueueProcessorThread.run(ExecutorManager.java:1793)
Caused by: java.net.NoRouteToHostException: No route to host (Host unreachable)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:72)
at org.apache.http.impl.conn.HttpClientConnectionOperator.connect(HttpClientConnectionOperator.java:118)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:314)

涉及到​​route​​ (路由)的情况,第一反应应该就去查找网络是否通畅。主要包括如下几个方面:


  • web 和 executor的port是否配置正确。这个时候可以去查找azkaban使用的mysql表中的数据
  • 是否是防火墙的原因导致回路不通
    这里笔者的情况是第二种,原因是小组开发人员执行了其它的脚本,开启了防火墙,导致无法同executor连通。