logstash报错现象

Trouble parsing json {:source=>"message", :raw=>"{\"@timestamp\":\"2016-05-30T14:51:27+08:00\",\"host\":\"10.139.48.166\",\"clientip\":\"180.109.110.203\",\"request_method\":\"GET\",\"size\":4286,\"responsetime\":0.000,\"upstreamtime\":\"-\",\"upstreamhost\":\"-\",\"http_host\":\"www.xxxx.com\",\"url\":\"/favicon.ico\",\"complete_url\":\"http://www.xxxx.com/favicon.ico\",\"referer\":\"-\",\"agent\":\"\\xE7\\x99\\xBE\\xE5\\xBA\\xA6HD 4.4.1 rv:4.4.1.2 (iPad; iPhone OS 8.3; zh_CN)\",\"status\":\"200\"}", :exception=>#<LogStash::Json::ParserError: Unrecognized character escape 'x' (code 120)报错信息说明:Unrecognized character escape 'x'

意思:无法识别的字符转义 'x'

搜索这条信息:\"agent\":\"\\xE7\\x99\\xBE\\xE5\\xBA\\x

发现是URL链接包含中文后,json的时候对于字符串\\xE7,把x当做需要转义的字符,问题是,\\不是双重转义么,奇怪!

问题解决过程

交代环境

centos 6.7

logstash 1.5

nginx日志数据定义

log_format json '{"@timestamp":"$time_iso8601",'

'"host":"$server_addr",'

'"clientip":"$remote_addr",'

'"request_method":"$request_method",'

'"size":$body_bytes_sent,'

'"responsetime":$request_time,'

'"upstreamtime":"$upstream_response_time",'

'"upstreamhost":"$upstream_addr",'

'"http_host":"$host",'

'"url":"$uri",'

'"complete_url":"$scheme://$host$request_uri",'

'"referer":"$http_referer",'

'"agent":"$http_user_agent",'

'"status":"$status"}';

logstash定义

input {

syslog {

port => "12210"

}

}

filter {

json {

source => "message"

}

geoip {

source => "clientip"

}

}

output{

elasticsearch { host => "127.0.0.1"

index => "nginx-logs-%{+YYYY.MM.dd}"

index_type => "logs"

}

}

解决无法识别的字符转义 'x'方法

使用mutte对\\x字符串进行替换,在json化之前

抽取logstash配置文件中filter片段

filter {

mutate {

gsub => ["message", "\\x", "\\\x"]


}


json {

source => "message"

}

geoip {

source => "clientip"

}

}

解释

gsub => ["message", "\\x", "\\\x"]

将message字段中,"\\x"字符串替换为"\\\x"

结果展示


logstash不再输出错误信息,complete_url展示的url链接中中文正常,url没有解析出来

分析为使用mutte处理前相同url日志

Trouble parsing json {:source=>"message", :raw=>"{\"@timestamp\":\"2016-05-30T18:21:35+08:00\",\"host\":\"10.139.48.166\",\"clientip\":\"58.250.164.208\",\"request_method\":\"GET\",\"size\":1338,\"responsetime\":0.008,\"upstreamtime\":\"0.008\",\"upstreamhost\":\"10.139.39.45:8801\",\"http_host\":\"www.qhfax.com\",\"url\":\"/aaa/\\xE6\\x88\\x91\\xE6\\x98\\xAF\\xE4\\xB8\\x80\\xE4\\xB8\\xAA\\xE4\\xBA\\xBA\",\"complete_url\":\"https://www.qhfax.com/aaa/%E6%88%91%E6%98%AF%E4%B8%80%E4%B8%AA%E4%BA%BA\",\"referer\":\"-\",\"agent\":\"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36\",\"status\":\"404\"}", :exception=>#<LogStash::Json::ParserError: Unrecognized character escape 'x' (code 120)

发现:

\"complete_url\":\"https://www.qhfax.com/aaa/%E6%88%91%E6%98%AF%E4%B8%80%E4%B8%AA%E4%BA%BA\"

\"url\":\"/aaa/\\xE6\\x88\\x91\\xE6\\x98\\xAF\\xE4\\xB8\\x80\\xE4\\xB8\\xAA\\xE4\\xBA\\xBA\"

居然两条语句输出的结果都是不一致

分析nginx配置片段

'"url":"$uri",'

'"complete_url":"$scheme://$host$request_uri",'

解释:

$uri请求中的当前URI(不带请求参数,参数位于$args),不同于浏览器传递的$request_uri的值,它可以通过内部重定向,或者使用index指令进行修改。不包括协议和主机名,例如/foo/bar.html

$request_uri 这个变量等于包含一些客户端请求参数的原始URI,它无法修改,请查看$uri更改或重写URI

也就是说:$request_uri是原始请求URL$uri则是经过nginx处理请求后剔除参数的URL,所以会将汉字表现为union

坑点:

使用$uri 可以在nginxURL进行更改或重写,但是用于日志输出可以使用$request_uri代替,如无特殊业务需求,完全可以替换