大数据日志传输工具Flume概述论文大数据日志采集工具

转载

mob6454cc6c1f4a 2024-03-15 05:34:41

文章标签 大数据日志传输工具Flume概述论文 html elasticsearch nginx 文章分类 架构后端开发

一. ELKStack简介
ELK Stack 是 Elasticsearch、Logstash、Kibana 三个开源软件的组合。在实时数据检索和分析场合，三者通常是配合共用，而且又都先后归于 Elastic.co 公司名下，故有此简称。

大数据（big data），指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合，是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样化的信息资产。
简单来说：把客户的访问公司的各个方面的访问量（方面很广，包括访问量、流量峰值等等）通过新处理模式进行导入和预处理来表达，进行数据分析来直观解决掉问题。

ELK Stack 在最近两年迅速崛起，成为机器数据分析，或者说实时日志处理领域，开源界的第一选择。和传统的日志处理方案相比，ELKStack具有如下几个优点：

• 处理方式灵活。Elasticsearch 是实时全文索引，不需要像 storm 那样预先编程才能使用；
• 配置简易上手。Elasticsearch 全部采用 JSON 接口，Logstash 是 Ruby DSL 设计，都是目前业界最通用的配置语法设计；
• 检索性能高效。虽然每次查询都是实时计算，但是优秀的设计和实现基本可以达到全天数据查询的秒级响应；
• 集群线性扩展。不管是 Elasticsearch 集群还是 Logstash 集群都是可以线性扩展的；
• 前端操作炫丽。Kibana 界面上，只需要点击鼠标，就可以完成搜索、聚合功能，生成炫丽的仪表板。
elk：
Elasticsearch是个开源分布式搜索引擎，提供搜集、分析、存储数据三大功能
Logstash 主要是用来日志的搜集、分析、过滤日志的工具，支持大量的数据获取方式
Kibana 也是一个开源和免费的工具，Kibana可以为 Logstash 和 ElasticSearch 提供的日志分析友好的 Web 界面，可以帮助汇总、分析和搜索重要数据日志。
安装：
elasticsearch：
首先，在Windows上下载elasticsearch-analysis-ik-6.1.1.zip；
其次，上传到ES集群，解压缩；
unzip …
然后，将解压缩目录移动到ES的plugins目录
最后，重启ES。
然后复制到其他节点
重启elasticsearch
在bin目录下：使用非root账户，同时更改root和zpark目录权限chown -R zpark:zpark /root，，， chown -R zpark:zpark /home/zpark
./elasticsearch启动
扩展词典（即安装ik）：
（1）查看已有词典

ll /root/apps/elasticsearch-6.3.1/plugins/ik/config

2）自定义词典

mkdir custom
vi custom/new_word.dic
cat custom/new_word.dic
老铁
王者荣耀
洪荒之力
共有产权房
一带一路
（3）更新配置

vi IKAnalyzer.cfg.xml
cat IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict">custom/new_word.dic</entry>
     <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords"></entry>
    <!--用户可以在这里配置远程扩展字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--用户可以在这里配置远程扩展停止词字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

elasticsearch集群搭建：
1、分发 scp

2.修改每一台机器的配置文件
vi elasticsearch.yml:
a、集群的名字

cluster.name: my-es

b、节点名字

node.name: node-1

c、es保存数据的路径

path.data: /home/zpark/esdata/data

d、es生成log信息的路径

path.logs: /home/zpark/esdata/log

e、当前机器的ip地址

network.host: 192.168.81.129

f、大集群时指定的leader的候选者

discovery.zen.ping.unicast.hosts: [“hdp-1”, “hdp-2”,“hdp-3”]

验证集群的健康值：

http://hdp-1:9200/_cluster/health?pretty

注意：配置好集群启动前要清空存储数据的路径就是c、d两步的路径
注：
elasticsearch不依赖zookeeper即不需要人为指定机器选举制度
elasticsearch中索引（库），type（表），id，以json格式进行保存数据
elasticsearch集群心跳机制即时刻保持着心跳连接
elasticsearch优点：保存并检索数据：保存数据更多查询速度更快

kibana安装配置:
1.解压
2.配置
vi kibana.yml:

server.host: “192.168.81.130”
server.name: “hdp-2”
elasticsearch.url: “http://hdp-2:9200”
kibana.index: “.kibana”
3.启动方式bin

./kibana

4.外网访问
http://192.168.81.130:5601

logstash安装配置：
安装：解压配置

功能：采集日志文件(与flume功能相似)

（1）控制台输入和控制台输出 input { stdin { } } output { stdout {} }

命令：

bin/logstash -e ‘input { stdin { } } output { stdout {} }’

在控制台上写helloyou 会被Logstash采集到并打印到控制台上

（2）把配置信息写到文件中

vi console.conf

input { stdin { } } output { stdout {} }

bin/logstash -f console.conf

（3）检测文件数据发生变化就采集

vi file.conf

input {
    file{
        path => "/root/apps/logstash-5.6.16/data.txt"
    }
 }
output {
     stdout {}
 }

启动：

bin/logstash -f file.conf
Logstash + ElasticSearch+kibana跟踪日志采集到es中，在es中可以用Kibana查看：
vi filetoes.conf

input { 
	file{
        path => "/root/apps/logstash-5.6.16/testFile.txt"
    }
 }
output {
  elasticsearch { hosts => ["hdp-4:9200"] }
  stdout { codec => rubydebug }
}

bin/logstash -f filetoes.conf

此时logstash会一直跟进testFile.txt,如果有变化就会被采集到es中
在es中可以用Kibana查看

将自己的项目打成jar包(告诉我返回值最前面不能加 / ，使用template引入文件前面也不能加/),上传到linux集群、通过java -jar(注意:打jar包的时候yml文件url那里的localhost要写成ip地址)，测试是否运行成功：hdp-1:8989
启动nginx 目的是为了产生日志，还有负载均衡和反向代理以后更新，重点是配置文件
nginx作用：1.转发地址（默认端口是80，hdp-8:80需要转发到frame项目的首页index.html）
2.负载均衡：通过nginx把用户的请求均衡的分发给集群中不同的frame.jar(真正的配置是通过配置nginx的文件配置的)
3.产生日志，（能够记录用户操作，访问记录，方便后期分析）
（问题：nginx产生的日文件过大怎么办；需求：让nginx产生的日志定期的滚动起来）
配置文件：

#user  nobody;
worker_processes  1;

#error_log  logs/error.log;
#error_log  logs/error.log  notice;
#error_log  logs/error.log  info;

#pid        logs/nginx.pid;


events {
    worker_connections  1024;
}


http {
    include       mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr';

    #access_log  logs/access.log  main;

    sendfile        on;
    #tcp_nopush     on;

    #keepalive_timeout  0;
    keepalive_timeout  65;

    #gzip  on;
    upstream frame-tomcat {
          server hdp-1:8989 ; 
    }
    server {
        listen       80;
        server_name  hdp-0;

        #charset koi8-r;
         access_log  logs/log.BiSheThree.access.log  main;
        #access_log  logs/log.frame.access.log  main;

        location / {
            # root   html;
            # index  index.html index.htm;
            proxy_pass http://frame-tomcat;
        }

        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;
        }
    }
    server {
        listen       80;
        server_name  localhost;

        #charset koi8-r;

        #access_log  logs/host.access.log  main;

        location / {
            root   html;
            index  index.html index.htm;
        }

        #error_page  404              /404.html;

        # redirect server error pages to the static page /50x.html
        #
        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;
        }

        # proxy the PHP scripts to Apache listening on 127.0.0.1:80
        #
        #location ~ \.php$ {
        #    proxy_pass   http://127.0.0.1;
        #}

        # pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
        #
        #location ~ \.php$ {
        #    root           html;
        #    fastcgi_pass   127.0.0.1:9000;
        #    fastcgi_index  index.php;
        #    fastcgi_param  SCRIPT_FILENAME  /scripts$fastcgi_script_name;
        #    include        fastcgi_params;
        #}

        # deny access to .htaccess files, if Apache's document root
        # concurs with nginx's one
        #
        #location ~ /\.ht {
        #    deny  all;
        #}
    }


    # another virtual host using mix of IP-, name-, and port-based configuration
    #
    #server {
    #    listen       8000;
    #    listen       somename:8080;
    #    server_name  somename  alias  another.alias;

    #    location / {
    #        root   html;
    #        index  index.html index.htm;
    #    }
    #}


    # HTTPS server
    #
    #server {
    #    listen       443;
    #    server_name  localhost;

    #    ssl                  on;
    #    ssl_certificate      cert.pem;
    #    ssl_certificate_key  cert.key;

    #    ssl_session_timeout  5m;

    #    ssl_protocols  SSLv2 SSLv3 TLSv1;
    #    ssl_ciphers  HIGH:!aNULL:!MD5;
    #    ssl_prefer_server_ciphers   on;

    #    location / {
    #        root   html;
    #        index  index.html index.htm;
    #    }
    #}

}

bin目录下 ./nginx -s reload
logstash:
vi bishe.conf

input { 
        file{
        path => "/usr/local/nginx/logs/log.BiSheThree.access.log"
    }
 }
output {
  elasticsearch { hosts => ["hdp-1:9200"] }
  stdout { codec => rubydebug }
}

bin/logstash -f bishe.conf
elasticsearch:
chown -R zpark:zpark /root
chown -R zpark:zpark /home/zpark
./elasticsearch启动
kibana:
./kibana(在elasticsearch所启动的那台机器启动)
外网访问
http://192.168.81.130:5601