1). 日志格式分析

首先分析 Hadoop 的日志格式, 日志是一行一条, 日志格式可以依次描述为:日期、时间、级别、相关类和提示信息。如下所示:

2014-01-07 00:31:25,393 INFO org.apache.hadoop.mapred.JobTracker: SHUTDOWN_MSG: 
 /************************************************************
 SHUTDOWN_MSG: Shutting down JobTracker at hadoop1/192.168.91.101
 ************************************************************/
 2014-01-07 00:33:42,425 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG: 
 /************************************************************
 STARTUP_MSG: Starting JobTracker
 STARTUP_MSG:   host = hadoop1/192.168.91.101
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 1.1.2
 STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782; compiled by 'hortonfo' on Thu Jan 31 02:03:24 UTC 2013
 ************************************************************/
 2014-01-07 00:33:43,305 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
 2014-01-07 00:33:43,358 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered.
 2014-01-07 00:33:43,359 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
 2014-01-07 00:33:43,359 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: JobTracker metrics system started
 2014-01-07 00:33:43,562 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source QueueMetrics,q=default registered.
 2014-01-07 00:33:44,118 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered.
 2014-01-07 00:33:44,118 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens
 2014-01-07 00:33:44,119 INFO org.apache.hadoop.mapred.JobTracker: Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT, limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)
 2014-01-07 00:33:44,120 INFO org.apache.hadoop.util.HostsFileReader: Refreshing hosts (include/exclude) list
 2014-01-07 00:33:44,125 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Starting expired delegation token remover thread, tokenRemoverScanInterval=60 min(s)
 2014-01-07 00:33:44,125 INFO org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager: Updating the current master key for generating delegation tokens
 2014-01-07 00:33:44,126 INFO org.apache.hadoop.mapred.JobTracker: Starting jobtracker with owner as root
 2014-01-07 00:33:44,187 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source RpcDetailedActivityForPort9001 registered.
 2014-01-07 00:33:44,187 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source RpcActivityForPort9001 registered.
 2014-01-07 00:33:44,188 INFO org.apache.hadoop.ipc.Server: Starting SocketReader
 2014-01-07 00:33:44,490 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
 2014-01-07 00:33:44,805 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
 2014-01-07 00:33:44,825 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener

这只是部分日志。

hadoop 日志清理脚本 hadoop日志处理_数据库

2). 程序设计
本程序是在个人机器用 Eclipse 开发,该程序连接 Hadoop 集群,处理完的结果存储在MySQL 服务器上。
MySQL 数据库的存储信息的表“hadooplog”的 SQL 语句如下:


drop table if exists  hadooplog;
create table hadooplog(
    id int(11) not null auto_increment,
    rdate varchar(50)  null,
    time varchar(50) default null,
    type varchar(50) default null,
    relateclass tinytext default null,
    information longtext default null,
    primary key (id)
) engine=innodb default charset=utf8;



操作如下:进入mysql 直接执行sql语句就行,创建一个hadooplog表

 3). 程序代码

package com.wzl.hive;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;

/**
 * 该类的主要功能是负责建立与 Hive 和 MySQL 的连接, 由于每个连接的开销比较大, 所以此类的设计采用设计模式中的单例模式。
 */
class DBHelper {
        private static Connection connToHive = null;
        private static Connection connToMySQL = null;

        private DBHelper() {
        }

        // 获得与 Hive 连接,如果连接已经初始化,则直接返回
        public static Connection getHiveConn() throws SQLException {
                if (connToHive == null) {
                        try {
                                Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");
                        } catch (ClassNotFoundException err) {
                                err.printStackTrace();
                                System.exit(1);
                        }
                        connToHive = DriverManager.getConnection("jdbc:hive://192.168.91.101:10000/default", "hive", "");
                }
                return connToHive;
        }

        // 获得与 MySQL 连接
        public static Connection getMySQLConn() throws SQLException {
                if (connToMySQL == null) {
                        try {
                                Class.forName("com.mysql.jdbc.Driver");
                        } catch (ClassNotFoundException err) {
                                err.printStackTrace();
                                System.exit(1);
                        }

                        connToMySQL = DriverManager.getConnection("jdbc:mysql://192.168.91.101:3306/hive?useUnicode=true&characterEncoding=UTF8",
                                        "root", "root"); //编码不要写成UTF-8
                }
                return connToMySQL;
        }

        public static void closeHiveConn() throws SQLException {
                if (connToHive != null) {
                        connToHive.close();
                }
        }

        public static void closeMySQLConn() throws SQLException {
                if (connToMySQL != null) {
                        connToMySQL.close();
                }
        }
        
        public static void main(String[] args) throws SQLException {
                System.out.println(getMySQLConn());
                closeMySQLConn();
        }

}



package com.wzl.hive;

import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;

/**
 * 
 * 针对 Hive 的工具类
 */
class HiveUtil {
        // 创建表
        public static void createTable(String sql) throws SQLException {
                Connection conn = DBHelper.getHiveConn();
                Statement stmt = conn.createStatement();
                ResultSet res = stmt.executeQuery(sql);
        }

        // 依据条件查询数据
        public static ResultSet queryData(String sql) throws SQLException {
                Connection conn = DBHelper.getHiveConn();
                Statement stmt = conn.createStatement();
                ResultSet res = stmt.executeQuery(sql);
                return res;
        }

        // 加载数据
        public static void loadData(String sql) throws SQLException {
                Connection conn = DBHelper.getHiveConn();
                Statement stmt = conn.createStatement();
                ResultSet res = stmt.executeQuery(sql);
        }

        // 把数据存储到 MySQL 中
        public static void hiveToMySQL(ResultSet res) throws SQLException {
                Connection conn = DBHelper.getMySQLConn();
                Statement stmt = conn.createStatement();
                while (res.next()) {
                        String rdate = res.getString(1);
                        String time = res.getString(2);
                        String type = res.getString(3);
                        String relateclass = res.getString(4);
                        String information = res.getString(5) + res.getString(6) + res.getString(7);
                        StringBuffer sql = new StringBuffer();
                        sql.append("insert into hadooplog values(0,'");
                        sql.append(rdate + "','");
                        sql.append(time + "','");
                        sql.append(type + "','");
                        sql.append(relateclass + "','");
                        sql.append(information + "')");
                        System.out.println(sql.toString());
                        int i = stmt.executeUpdate(sql.toString());
                }
        }
}



package com.wzl.hive;

import java.sql.ResultSet;
import java.sql.SQLException;

public class AnalyszeHadoopLog {

        public static void main(String[] args) throws SQLException {
                StringBuffer sql = new StringBuffer();

                // 第一步:在 Hive 中创建表
                sql.append("create table if not exists loginfo( ");
                sql.append("rdate string,  ");
                sql.append("time array<string>, ");
                sql.append("type string, ");
                sql.append("relateclass string, ");
                sql.append("information1 string, ");
                sql.append("information2 string, ");
                sql.append("information3 string)  ");
                sql.append("row format delimited fields terminated by ' '  ");
                sql.append("collection items terminated by ','   ");
                sql.append("map keys terminated by  ':'");

                System.out.println(sql);
                HiveUtil.createTable(sql.toString());

                // 第二步:加载 Hadoop 日志文件
                sql.delete(0, sql.length());
                sql.append("load data local inpath ");
                sql.append("'/usr/local/hadoop/logs/hadoop-root-jobtracker-hadoop1.log'");
                sql.append(" overwrite into table loginfo");
                System.out.println(sql);
                HiveUtil.loadData(sql.toString());

                // 第三步:查询有用信息
                sql.delete(0, sql.length());
                sql.append("select rdate,time[0],type,relateclass,");
                sql.append("information1,information2,information3 ");
                sql.append("from loginfo where type='INFO'");
                System.out.println(sql);
                ResultSet res = HiveUtil.queryData(sql.toString());
                // 第四步:查出的信息经过变换后保存到 MySQL 中
                HiveUtil.hiveToMySQL(res);
                // 第五步:关闭 Hive 连接
                DBHelper.closeHiveConn();

                // 第六步:关闭 MySQL 连接
                DBHelper.closeMySQLConn();
        }
}



4). 运行结果


在执行之前要注意的问题:



  1. 在运行前必须保证hive远端服务端口是开的  执行命令:nohup hive --service hiveserver  &  如果没有执行这句命令常出现这个错误:Could not establish connection to 192.168.91.101:10000/default: java.net.ConnectException: Connection refused: connect
  2. mysql已经建立了hadooplog表
  3. mysql数据库允许本机连接数据库执行命令:grant all privileges on *.* to root@'%' identified by 'root'; 这句意思是允许任何的ip都能访问mysql数据库。如果如果没有执行这句命令常出现这个错误:java连接linux中mysql出现:Access denied for user 'root'@'192.168.91.1' (using password: YES)


mysql> use hive;
mysql> show tables;
mysql> select * from hadooplog;



5). 经验总结
在示例中同时对 Hive 的数据仓库库和 MySQL 数据库进行操作,虽然都是使用了 JDBC接口,但是一些地方还是有差异的,这个实战示例能比较好地体现 Hive 与关系型数据库的异同。
如果我们直接采用 MapReduce 来做,效率会比使用 Hive 高,因为 Hive 的底层就是调用了 MapReduce,但是程序的复杂度和编码量都会大大增加,特别是对于不熟悉 MapReduce编程的开发人员,这是一个棘手问题。Hive 在这两种方案中找到了平衡,不仅处理效率较高,而且实现起来也相对简单,给传统关系型数据库编码人员带来了便利,这就是目前 Hive被许多商业组织所采用的原因。