前言


 


方案介绍:


flume采用spoolDir的方式将文件传输到HDFS

因为一份文件要备份,一份文件要解析,因此使用了2个sink 对应2个channel

flume的

RegexExtractorExtInterceptor是根据源码重新编写的,功能是以文件名为header,分解header的值,来创建hadoop的目录,达到收集-分散到指定目录的效果.


ps:

RegexExtractorExtInterceptor打成jar放在flume的lib文件夹下即可



另外需要一个程序将各个服务器对应地址的日志文件采集到flume所在服务器的spoolDir文件夹内.

方式随意,目前采用了一个java程序定时取,写shell脚本scp拿也行. 






 

 一.搭建分布式hadoop环境


 


1.JDK环境配置(省略)



2.SSH公钥私钥配置


参考:



3.Host设置


vi /etc/hosts 

192.168.183.130 hadoop130 

192.168.183.131  
hadoop131 

192.168.183.132  
hadoop132 

192.168.183.133  
hadoop133

4.安装hadoop


tar -xvf hadoop-2.7.2.tar.gz

mv 

 hadoop-2.7.2 hadoop

 


5.设置hadoop环境变量


vi /etc/profile 


  export HADOOP_HOME=/opt/hadoop/ 
 
 export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin 
 

    
 
6.修改hadoop配置文件  
cd /opt/hadoop/conf
 vi core-site.xml 


< 
configuration 
> 
  

< 
property 
> 
  

< 
name 
> 
hadoop.tmp.dir 
</ 
name 
> 
  

< 
value 
> 
/usr/hadoop/tmp 
</ 
value 
>
  
< 
description 
>(备注:请先在 /usr/hadoop 目录下建立 tmp 文件夹) 
A base for other temporary directories. 
</ 
description 
>
  
  
</ 
property 
> 
  

< 
property 
>
  
  
< 
name 
> 
fs.default.name 
</ 
name 
>
  
  
< 
value 
> 
hdfs://192.168.183.130:9000 
</ 
value 
>
  
  
</ 
property 
>
  
</ 
configuration 
>

 备注:如没有配置hadoop

.tmp.dir参数,此时系统默认的临时目录为:/tmp/hadoo-hadoop。而这个目录在每次重启后都会被干掉,必须重新执行format才行,否则会出错。



vi hdfs-site.xml

<configuration>
  <property> 
<name>dfs.replication</name> 
<value>1</value>
  </property> 
<property> 
<name>dfs.namenode.name.dir</name>
  <value>file:/opt/hadoop/hdfs/name</value>
  <final>true</final> 
</property> 
<property> 
<name>dfs.datanode.data.dir</name>
  <value>file:/opt/hadoop/hdfs/data</value>
  </property>
 </configuration>


如果需要yarn
vi yarn-site.xml 
<configuration> 
<property>
  <name>mapred.job.tracker</name> 
<value>192.168.149.128:9001</value>
  </property>
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>192.168.183.130</value>//主机IP
 </property>
 </configuration>
  注意上面一定要填Ip,不要填localhost,不然eclipse会连接不到!

  设置主从关系$HADOOP_HOME/etc/hadoop/目录下: 
vi masters 
192.168.183.130 
 
 
//主机特有,从机可以不需要
 
vi slaves

 192.168.183.131
 192.168.183.132
 192.168.183.133 
 
 
hadoop namenode -format 
 
 
 
启动:
 
sbin/start-all.sh
 

    

 

  查看状态:主机 

 
 jps 

2751 ResourceManager 

2628 SecondaryNameNode 

2469 NameNode 
 
 

   查看状态:从机 
 
 
 
jps 
  
 1745 NodeManager 
  
 1658 DataNode 
  
      
  
 
  
 
   

       
   
 
   

     总共有5个hadoop线程  
   
 
   

       
   
 
   

     访问地址查看hdfs 的运行状态: 
   
 
   
 
    http://192.168.183.130:50070/

 

二.flume安装与配置


1.安装flume

tar -xvf apache-flume-1.5.0-bin.tar.gz

mv apache-flume-1.5.0-bin  flume



2.配置环境变量

vim /etc/profile
 
export FLUME_HOME=/opt/flume 
export FLUME_CONF_DIR=$FLUME_HOME/conf
 
export PATH=$PATH:$FLUME_HOME/bin
 

vim /opt/flume/conf/flume-env.sh 

指定文件内的jdk地址  


3.检查是否安装成功 

cd /opt/flume/bin/ 

./flume-ng version 

能看到版本信息说明安装成功 


4.按照前言的运用方式配置flume 

vim /opt/flume/conf/flume-conf.properties 


 
a1.sources = r1
 
a1.sinks = k1 k2
 
a1.channels = c1 c2 
 

a1.source.r1.selector.type = replicating 
 a1.sources.r1.type = spooldir
 
 
a1.sources.r1.spoolDir = /var/log/flume_spoolDir
 
 
a1.sources.r1.deletePolicy=immediate
 
 
a1.sources.r1.basenameHeader=true
 
 
#忽略copy时的.tmp文件,避免同时读写的问题
 
 
#a1.sources.r1.ignorePattern = ^(.)*\\.tmp$   
 
 

 
 
a1.sources.r1.interceptors=i1  
 
 
a1.sources.r1.interceptors.i1.type=org.apache.flume.interceptor.RegexExtractorExtInterceptor$Builder  
 
 
a1.sources.r1.interceptors.i1.regex=(.*)_(.*)\\.(.*)  
 
 
a1.sources.r1.interceptors.i1.extractorHeader=true  
 
 
a1.sources.r1.interceptors.i1.extractorHeaderKey=basename  
 
 
a1.sources.r1.interceptors.i1.serializers=s1 s2 s3
 
 
#basename's value must be filename_date.suffix. example storelog_2015-03-16.log
 
 
a1.sources.r1.interceptors.i1.serializers.s1.name=filename  
 
 
a1.sources.r1.interceptors.i1.serializers.s2.name=date  
 
 
a1.sources.r1.interceptors.i1.serializers.s3.name=suffix 
 
 
a1.sources.r1.channels = c1 c2
 
 

 
 
# Describe the sink
 
 
a1.sinks.k1.type =hdfs
 
 
a1.sinks.k1.hdfs.path=hdfs://store.qbao.com:9000/storelog/bak/%{date}/%{filename}
 
 
a1.sinks.k1.hdfs.filePrefix=%{filename}_%{date}
 
 
a1.sinks.k1.hdfs.round = true
 
 
a1.sinks.k1.hdfs.rollInterval = 60
 
 
# File size to trigger roll, in bytes (0: never roll based on file size)
 
 
a1.sinks.k1.hdfs.rollSize = 128000000
 
 
a1.sinks.k1.hdfs.rollCount = 0
 
 
a1.sinks.k1.hdfs.batchSize = 100
 
 
a1.sinks.k1.hdfs.idleTimeout=60
 
 
a1.sinks.k1.hdfs.roundValue = 1
 
 
a1.sinks.k1.hdfs.roundUnit = minute
 
 
a1.sinks.k1.hdfs.useLocalTimeStamp = true
 
 
a1.sinks.k1.hdfs.fileType = DataStream
 
 

 
 
# Use a channel which buffers events in memory
 
 
#a1.channels.c1.type = memory
 
 
#a1.channels.c1.capacity = 1000
 
 
#a1.channels.c1.transactionCapacity = 200
 
 

 
 
a1.channels.c1.type = file
 
 
a1.channels.c1.checkpointDir=/opt/flume/checkpoint_c1
 
 
a1.channels.c1.dataDirs=/opt/flume/dataDir_c1
 
 

 
 

 
 
# Bind the source and sink to the channel
 
 
a1.sinks.k1.channel = c1
 
 

 
 

 
 

 
 

 
 
# Describe the sink
 
 
a1.sinks.k2.type =hdfs
 
 
a1.sinks.k2.hdfs.path=hdfs://store.qbao.com:9000/storelog/etl/%{filename}
 
 
a1.sinks.k2.hdfs.filePrefix=%{filename}_%{date}
 
 
a1.sinks.k2.hdfs.round = true
 
 
a1.sinks.k2.hdfs.rollInterval = 60
 
 
# File size to trigger roll, in bytes (0: never roll based on file size)
 
 
a1.sinks.k2.hdfs.rollSize = 128000000
 
 
a1.sinks.k2.hdfs.rollCount = 0
 
 
a1.sinks.k2.hdfs.batchSize = 100
 
 
a1.sinks.k2.hdfs.idleTimeout=60
 
 
a1.sinks.k2.hdfs.roundValue = 1
 
 
a1.sinks.k2.hdfs.roundUnit = minute
 
 
a1.sinks.k2.hdfs.useLocalTimeStamp = true
 
 
a1.sinks.k2.hdfs.fileType = DataStream
 
 

 
 
# Use a channel which buffers events in memory
 
 
#a1.channels.c2.type = memory
 
 
#a1.channels.c2.capacity = 1000
 
 
#a1.channels.c2.transactionCapacity = 200
 
 

 
 
a1.channels.c2.type = file
 
 
a1.channels.c2.checkpointDir=/opt/flume/checkpoint_c2
 
 
a1.channels.c2.dataDirs=/opt/flume/dataDir_c2
 
 

 
 a1.sinks.k2.channel=c2 
   

  


5.启动flume 

cd /opt/flume
 
nohup bin/flume-ng agent -n a1 -c conf -f conf/flume-conf.properties&


   
  

ps:所需用的包 


http://pan.baidu.com/s/1hshgS4G
vim /etc/profile
 
export FLUME_HOME=/opt/flume 
export FLUME_CONF_DIR=$FLUME_HOME/conf
 
export PATH=$PATH:$FLUME_HOME/bin
 

vim /opt/flume/conf/flume-env.sh 

指定文件内的jdk地址  


3.检查是否安装成功 

cd /opt/flume/bin/ 

./flume-ng version 

能看到版本信息说明安装成功 


4.按照前言的运用方式配置flume 

vim /opt/flume/conf/flume-conf.properties 


 
a1.sources = r1
 
a1.sinks = k1 k2
 
a1.channels = c1 c2 
 

a1.source.r1.selector.type = replicating 
 a1.sources.r1.type = spooldir
 
 
a1.sources.r1.spoolDir = /var/log/flume_spoolDir
 
 
a1.sources.r1.deletePolicy=immediate
 
 
a1.sources.r1.basenameHeader=true
 
 
#忽略copy时的.tmp文件,避免同时读写的问题
 
 
#a1.sources.r1.ignorePattern = ^(.)*\\.tmp$   
 
 

 
 
a1.sources.r1.interceptors=i1  
 
 
a1.sources.r1.interceptors.i1.type=org.apache.flume.interceptor.RegexExtractorExtInterceptor$Builder  
 
 
a1.sources.r1.interceptors.i1.regex=(.*)_(.*)\\.(.*)  
 
 
a1.sources.r1.interceptors.i1.extractorHeader=true  
 
 
a1.sources.r1.interceptors.i1.extractorHeaderKey=basename  
 
 
a1.sources.r1.interceptors.i1.serializers=s1 s2 s3
 
 
#basename's value must be filename_date.suffix. example storelog_2015-03-16.log
 
 
a1.sources.r1.interceptors.i1.serializers.s1.name=filename  
 
 
a1.sources.r1.interceptors.i1.serializers.s2.name=date  
 
 
a1.sources.r1.interceptors.i1.serializers.s3.name=suffix 
 
 
a1.sources.r1.channels = c1 c2
 
 

 
 
# Describe the sink
 
 
a1.sinks.k1.type =hdfs
 
 
a1.sinks.k1.hdfs.path=hdfs://store.qbao.com:9000/storelog/bak/%{date}/%{filename}
 
 
a1.sinks.k1.hdfs.filePrefix=%{filename}_%{date}
 
 
a1.sinks.k1.hdfs.round = true
 
 
a1.sinks.k1.hdfs.rollInterval = 60
 
 
# File size to trigger roll, in bytes (0: never roll based on file size)
 
 
a1.sinks.k1.hdfs.rollSize = 128000000
 
 
a1.sinks.k1.hdfs.rollCount = 0
 
 
a1.sinks.k1.hdfs.batchSize = 100
 
 
a1.sinks.k1.hdfs.idleTimeout=60
 
 
a1.sinks.k1.hdfs.roundValue = 1
 
 
a1.sinks.k1.hdfs.roundUnit = minute
 
 
a1.sinks.k1.hdfs.useLocalTimeStamp = true
 
 
a1.sinks.k1.hdfs.fileType = DataStream
 
 

 
 
# Use a channel which buffers events in memory
 
 
#a1.channels.c1.type = memory
 
 
#a1.channels.c1.capacity = 1000
 
 
#a1.channels.c1.transactionCapacity = 200
 
 

 
 
a1.channels.c1.type = file
 
 
a1.channels.c1.checkpointDir=/opt/flume/checkpoint_c1
 
 
a1.channels.c1.dataDirs=/opt/flume/dataDir_c1
 
 

 
 

 
 
# Bind the source and sink to the channel
 
 
a1.sinks.k1.channel = c1
 
 

 
 

 
 

 
 

 
 
# Describe the sink
 
 
a1.sinks.k2.type =hdfs
 
 
a1.sinks.k2.hdfs.path=hdfs://store.qbao.com:9000/storelog/etl/%{filename}
 
 
a1.sinks.k2.hdfs.filePrefix=%{filename}_%{date}
 
 
a1.sinks.k2.hdfs.round = true
 
 
a1.sinks.k2.hdfs.rollInterval = 60
 
 
# File size to trigger roll, in bytes (0: never roll based on file size)
 
 
a1.sinks.k2.hdfs.rollSize = 128000000
 
 
a1.sinks.k2.hdfs.rollCount = 0
 
 
a1.sinks.k2.hdfs.batchSize = 100
 
 
a1.sinks.k2.hdfs.idleTimeout=60
 
 
a1.sinks.k2.hdfs.roundValue = 1
 
 
a1.sinks.k2.hdfs.roundUnit = minute
 
 
a1.sinks.k2.hdfs.useLocalTimeStamp = true
 
 
a1.sinks.k2.hdfs.fileType = DataStream
 
 

 
 
# Use a channel which buffers events in memory
 
 
#a1.channels.c2.type = memory
 
 
#a1.channels.c2.capacity = 1000
 
 
#a1.channels.c2.transactionCapacity = 200
 
 

 
 
a1.channels.c2.type = file
 
 
a1.channels.c2.checkpointDir=/opt/flume/checkpoint_c2
 
 
a1.channels.c2.dataDirs=/opt/flume/dataDir_c2
 
 

 
 a1.sinks.k2.channel=c2 
   

  


5.启动flume 

cd /opt/flume
 
nohup bin/flume-ng agent -n a1 -c conf -f conf/flume-conf.properties&


   
  

ps:所需用的包 


http://pan.baidu.com/s/1hshgS4G
vim /etc/profile
 
export FLUME_HOME=/opt/flume 
export FLUME_CONF_DIR=$FLUME_HOME/conf
 
export PATH=$PATH:$FLUME_HOME/bin
 

vim /opt/flume/conf/flume-env.sh 

指定文件内的jdk地址  


3.检查是否安装成功 

cd /opt/flume/bin/ 

./flume-ng version 

能看到版本信息说明安装成功 


4.按照前言的运用方式配置flume 

vim /opt/flume/conf/flume-conf.properties 


 
a1.sources = r1
 
a1.sinks = k1 k2
 
a1.channels = c1 c2 
 

a1.source.r1.selector.type = replicating 
 a1.sources.r1.type = spooldir
 
 
a1.sources.r1.spoolDir = /var/log/flume_spoolDir
 
 
a1.sources.r1.deletePolicy=immediate
 
 
a1.sources.r1.basenameHeader=true
 
 
#忽略copy时的.tmp文件,避免同时读写的问题
 
 
#a1.sources.r1.ignorePattern = ^(.)*\\.tmp$   
 
 

 
 
a1.sources.r1.interceptors=i1  
 
 
a1.sources.r1.interceptors.i1.type=org.apache.flume.interceptor.RegexExtractorExtInterceptor$Builder  
 
 
a1.sources.r1.interceptors.i1.regex=(.*)_(.*)\\.(.*)  
 
 
a1.sources.r1.interceptors.i1.extractorHeader=true  
 
 
a1.sources.r1.interceptors.i1.extractorHeaderKey=basename  
 
 
a1.sources.r1.interceptors.i1.serializers=s1 s2 s3
 
 
#basename's value must be filename_date.suffix. example storelog_2015-03-16.log
 
 
a1.sources.r1.interceptors.i1.serializers.s1.name=filename  
 
 
a1.sources.r1.interceptors.i1.serializers.s2.name=date  
 
 
a1.sources.r1.interceptors.i1.serializers.s3.name=suffix 
 
 
a1.sources.r1.channels = c1 c2
 
 

 
 
# Describe the sink
 
 
a1.sinks.k1.type =hdfs
 
 
a1.sinks.k1.hdfs.path=hdfs://store.qbao.com:9000/storelog/bak/%{date}/%{filename}
 
 
a1.sinks.k1.hdfs.filePrefix=%{filename}_%{date}
 
 
a1.sinks.k1.hdfs.round = true
 
 
a1.sinks.k1.hdfs.rollInterval = 60
 
 
# File size to trigger roll, in bytes (0: never roll based on file size)
 
 
a1.sinks.k1.hdfs.rollSize = 128000000
 
 
a1.sinks.k1.hdfs.rollCount = 0
 
 
a1.sinks.k1.hdfs.batchSize = 100
 
 
a1.sinks.k1.hdfs.idleTimeout=60
 
 
a1.sinks.k1.hdfs.roundValue = 1
 
 
a1.sinks.k1.hdfs.roundUnit = minute
 
 
a1.sinks.k1.hdfs.useLocalTimeStamp = true
 
 
a1.sinks.k1.hdfs.fileType = DataStream
 
 

 
 
# Use a channel which buffers events in memory
 
 
#a1.channels.c1.type = memory
 
 
#a1.channels.c1.capacity = 1000
 
 
#a1.channels.c1.transactionCapacity = 200
 
 

 
 
a1.channels.c1.type = file
 
 
a1.channels.c1.checkpointDir=/opt/flume/checkpoint_c1
 
 
a1.channels.c1.dataDirs=/opt/flume/dataDir_c1
 
 

 
 

 
 
# Bind the source and sink to the channel
 
 
a1.sinks.k1.channel = c1
 
 

 
 

 
 

 
 

 
 
# Describe the sink
 
 
a1.sinks.k2.type =hdfs
 
 
a1.sinks.k2.hdfs.path=hdfs://store.qbao.com:9000/storelog/etl/%{filename}
 
 
a1.sinks.k2.hdfs.filePrefix=%{filename}_%{date}
 
 
a1.sinks.k2.hdfs.round = true
 
 
a1.sinks.k2.hdfs.rollInterval = 60
 
 
# File size to trigger roll, in bytes (0: never roll based on file size)
 
 
a1.sinks.k2.hdfs.rollSize = 128000000
 
 
a1.sinks.k2.hdfs.rollCount = 0
 
 
a1.sinks.k2.hdfs.batchSize = 100
 
 
a1.sinks.k2.hdfs.idleTimeout=60
 
 
a1.sinks.k2.hdfs.roundValue = 1
 
 
a1.sinks.k2.hdfs.roundUnit = minute
 
 
a1.sinks.k2.hdfs.useLocalTimeStamp = true
 
 
a1.sinks.k2.hdfs.fileType = DataStream
 
 

 
 
# Use a channel which buffers events in memory
 
 
#a1.channels.c2.type = memory
 
 
#a1.channels.c2.capacity = 1000
 
 
#a1.channels.c2.transactionCapacity = 200
 
 

 
 
a1.channels.c2.type = file
 
 
a1.channels.c2.checkpointDir=/opt/flume/checkpoint_c2
 
 
a1.channels.c2.dataDirs=/opt/flume/dataDir_c2
 
 

 
 a1.sinks.k2.channel=c2 
   

  


5.启动flume 

cd /opt/flume
 
nohup bin/flume-ng agent -n a1 -c conf -f conf/flume-conf.properties&


   
  

ps:所需用的包 


http://pan.baidu.com/s/1hshgS4G