Tomcat 日志文件目录、脚本正则表达式抓取

 

1、创建hive表:apachelog 

语句如下: 

CREATE TABLE apachelog (

 host STRING,

 identity STRING,

 t_user STRING,

 time STRING,

 type STRING,

 http STRING,

 http_type STRING,

 status STRING,

 agent STRING

 )

 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

 WITH SERDEPROPERTIES (

 "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(.*?) .*?\\] \"([^ ]*) (.*?)\" ([^ ]*) ([^ ]*)"

  )

  STORED AS TEXTFILE;

 

最后load日志文件:

#LOAD DATA LOCAL INPATH  'log日志的绝对目录'

 

2、可以添加一个定时任务每小时去执行日志收集:

 

crontab -e

*/2400 * * * * /usr/sbin/sh shell脚本

 

日志格式可以如下:

127.0.0.1 - - [24/Apr/2016:09:55:45 +0800] "GET / HTTP/1.1" 200 11418
127.0.0.1 - - [24/Apr/2016:09:55:47 +0800] "GET / HTTP/1.1" 200 11418
127.0.0.1 - - [24/Apr/2016:09:57:52 +0800] "GET / HTTP/1.1" 200 11418
0:0:0:0:0:0:0:1 - - [24/Apr/2016:09:57:56 +0800] "GET / HTTP/1.1" 200 11418
0:0:0:0:0:0:0:1 - - [24/Apr/2016:09:57:56 +0800] "GET /tomcat.css HTTP/1.1" 200 5926
0:0:0:0:0:0:0:1 - - [24/Apr/2016:09:57:56 +0800] "GET /tomcat.png HTTP/1.1" 200 5103
0:0:0:0:0:0:0:1 - - [24/Apr/2016:09:57:56 +0800] "GET /bg-nav.png HTTP/1.1" 200 1401
0:0:0:0:0:0:0:1 - - [24/Apr/2016:09:57:56 +0800] "GET /asf-logo.png HTTP/1.1" 200 17811
0:0:0:0:0:0:0:1 - - [24/Apr/2016:09:57:56 +0800] "GET /bg-middle.png HTTP/1.1" 200 1918
0:0:0:0:0:0:0:1 - - [24/Apr/2016:09:57:56 +0800] "GET /bg-button.png HTTP/1.1" 200 713
0:0:0:0:0:0:0:1 - - [24/Apr/2016:09:57:56 +0800] "GET /bg-upper.png HTTP/1.1" 200 3103