前言
方案介绍:
flume采用spoolDir的方式将文件传输到HDFS
因为一份文件要备份,一份文件要解析,因此使用了2个sink 对应2个channel
flume的
RegexExtractorExtInterceptor是根据源码重新编写的,功能是以文件名为header,分解header的值,来创建hadoop的目录,达到收集-分散到指定目录的效果.
ps:
RegexExtractorExtInterceptor打成jar放在flume的lib文件夹下即可
另外需要一个程序将各个服务器对应地址的日志文件采集到flume所在服务器的spoolDir文件夹内.
方式随意,目前采用了一个java程序定时取,写shell脚本scp拿也行.
一.搭建分布式hadoop环境
1.JDK环境配置(省略)
2.SSH公钥私钥配置
参考:
3.Host设置
vi /etc/hosts
192.168.183.130 hadoop130
192.168.183.131
hadoop131
192.168.183.132
hadoop132
192.168.183.133
hadoop133
4.安装hadoop
tar -xvf hadoop-2.7.2.tar.gz
mv
hadoop-2.7.2 hadoop
5.设置hadoop环境变量
vi /etc/profile
export HADOOP_HOME=/opt/hadoop/
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
6.修改hadoop配置文件
cd /opt/hadoop/conf
vi core-site.xml
<
configuration
>
<
property
>
<
name
>
hadoop.tmp.dir
</
name
>
<
value
>
/usr/hadoop/tmp
</
value
>
<
description
>(备注:请先在 /usr/hadoop 目录下建立 tmp 文件夹)
A base for other temporary directories.
</
description
>
</
property
>
<
property
>
<
name
>
fs.default.name
</
name
>
<
value
>
hdfs://192.168.183.130:9000
</
value
>
</
property
>
</
configuration
>
备注:如没有配置hadoop
.tmp.dir参数,此时系统默认的临时目录为:/tmp/hadoo-hadoop。而这个目录在每次重启后都会被干掉,必须重新执行format才行,否则会出错。
vi hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/hadoop/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hadoop/hdfs/data</value>
</property>
</configuration>
如果需要yarn
vi yarn-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>192.168.149.128:9001</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>192.168.183.130</value>//主机IP
</property>
</configuration>
注意上面一定要填Ip,不要填localhost,不然eclipse会连接不到!
设置主从关系$HADOOP_HOME/etc/hadoop/目录下:
vi masters
192.168.183.130
//主机特有,从机可以不需要
vi slaves
192.168.183.131
192.168.183.132
192.168.183.133
hadoop namenode -format
启动:
sbin/start-all.sh
查看状态:主机
jps
2751 ResourceManager
2628 SecondaryNameNode
2469 NameNode
查看状态:从机
jps
1745 NodeManager
1658 DataNode
总共有5个hadoop线程
访问地址查看hdfs 的运行状态:
http://192.168.183.130:50070/
二.flume安装与配置
1.安装flume
tar -xvf apache-flume-1.5.0-bin.tar.gz
mv apache-flume-1.5.0-bin flume
2.配置环境变量
vim /etc/profile
export FLUME_HOME=/opt/flume
export FLUME_CONF_DIR=$FLUME_HOME/conf
export PATH=$PATH:$FLUME_HOME/bin
vim /opt/flume/conf/flume-env.sh
指定文件内的jdk地址
3.检查是否安装成功
cd /opt/flume/bin/
./flume-ng version
能看到版本信息说明安装成功
4.按照前言的运用方式配置flume
vim /opt/flume/conf/flume-conf.properties
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
a1.source.r1.selector.type = replicating
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /var/log/flume_spoolDir
a1.sources.r1.deletePolicy=immediate
a1.sources.r1.basenameHeader=true
#忽略copy时的.tmp文件,避免同时读写的问题
#a1.sources.r1.ignorePattern = ^(.)*\\.tmp$
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=org.apache.flume.interceptor.RegexExtractorExtInterceptor$Builder
a1.sources.r1.interceptors.i1.regex=(.*)_(.*)\\.(.*)
a1.sources.r1.interceptors.i1.extractorHeader=true
a1.sources.r1.interceptors.i1.extractorHeaderKey=basename
a1.sources.r1.interceptors.i1.serializers=s1 s2 s3
#basename's value must be filename_date.suffix. example storelog_2015-03-16.log
a1.sources.r1.interceptors.i1.serializers.s1.name=filename
a1.sources.r1.interceptors.i1.serializers.s2.name=date
a1.sources.r1.interceptors.i1.serializers.s3.name=suffix
a1.sources.r1.channels = c1 c2
# Describe the sink
a1.sinks.k1.type =hdfs
a1.sinks.k1.hdfs.path=hdfs://store.qbao.com:9000/storelog/bak/%{date}/%{filename}
a1.sinks.k1.hdfs.filePrefix=%{filename}_%{date}
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.rollInterval = 60
# File size to trigger roll, in bytes (0: never roll based on file size)
a1.sinks.k1.hdfs.rollSize = 128000000
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 100
a1.sinks.k1.hdfs.idleTimeout=60
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
#a1.channels.c1.type = memory
#a1.channels.c1.capacity = 1000
#a1.channels.c1.transactionCapacity = 200
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/flume/checkpoint_c1
a1.channels.c1.dataDirs=/opt/flume/dataDir_c1
# Bind the source and sink to the channel
a1.sinks.k1.channel = c1
# Describe the sink
a1.sinks.k2.type =hdfs
a1.sinks.k2.hdfs.path=hdfs://store.qbao.com:9000/storelog/etl/%{filename}
a1.sinks.k2.hdfs.filePrefix=%{filename}_%{date}
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.rollInterval = 60
# File size to trigger roll, in bytes (0: never roll based on file size)
a1.sinks.k2.hdfs.rollSize = 128000000
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.batchSize = 100
a1.sinks.k2.hdfs.idleTimeout=60
a1.sinks.k2.hdfs.roundValue = 1
a1.sinks.k2.hdfs.roundUnit = minute
a1.sinks.k2.hdfs.useLocalTimeStamp = true
a1.sinks.k2.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
#a1.channels.c2.type = memory
#a1.channels.c2.capacity = 1000
#a1.channels.c2.transactionCapacity = 200
a1.channels.c2.type = file
a1.channels.c2.checkpointDir=/opt/flume/checkpoint_c2
a1.channels.c2.dataDirs=/opt/flume/dataDir_c2
a1.sinks.k2.channel=c2
5.启动flume
cd /opt/flume
nohup bin/flume-ng agent -n a1 -c conf -f conf/flume-conf.properties&
ps:所需用的包
http://pan.baidu.com/s/1hshgS4G
vim /etc/profile
export FLUME_HOME=/opt/flume
export FLUME_CONF_DIR=$FLUME_HOME/conf
export PATH=$PATH:$FLUME_HOME/bin
vim /opt/flume/conf/flume-env.sh
指定文件内的jdk地址
3.检查是否安装成功
cd /opt/flume/bin/
./flume-ng version
能看到版本信息说明安装成功
4.按照前言的运用方式配置flume
vim /opt/flume/conf/flume-conf.properties
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
a1.source.r1.selector.type = replicating
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /var/log/flume_spoolDir
a1.sources.r1.deletePolicy=immediate
a1.sources.r1.basenameHeader=true
#忽略copy时的.tmp文件,避免同时读写的问题
#a1.sources.r1.ignorePattern = ^(.)*\\.tmp$
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=org.apache.flume.interceptor.RegexExtractorExtInterceptor$Builder
a1.sources.r1.interceptors.i1.regex=(.*)_(.*)\\.(.*)
a1.sources.r1.interceptors.i1.extractorHeader=true
a1.sources.r1.interceptors.i1.extractorHeaderKey=basename
a1.sources.r1.interceptors.i1.serializers=s1 s2 s3
#basename's value must be filename_date.suffix. example storelog_2015-03-16.log
a1.sources.r1.interceptors.i1.serializers.s1.name=filename
a1.sources.r1.interceptors.i1.serializers.s2.name=date
a1.sources.r1.interceptors.i1.serializers.s3.name=suffix
a1.sources.r1.channels = c1 c2
# Describe the sink
a1.sinks.k1.type =hdfs
a1.sinks.k1.hdfs.path=hdfs://store.qbao.com:9000/storelog/bak/%{date}/%{filename}
a1.sinks.k1.hdfs.filePrefix=%{filename}_%{date}
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.rollInterval = 60
# File size to trigger roll, in bytes (0: never roll based on file size)
a1.sinks.k1.hdfs.rollSize = 128000000
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 100
a1.sinks.k1.hdfs.idleTimeout=60
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
#a1.channels.c1.type = memory
#a1.channels.c1.capacity = 1000
#a1.channels.c1.transactionCapacity = 200
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/flume/checkpoint_c1
a1.channels.c1.dataDirs=/opt/flume/dataDir_c1
# Bind the source and sink to the channel
a1.sinks.k1.channel = c1
# Describe the sink
a1.sinks.k2.type =hdfs
a1.sinks.k2.hdfs.path=hdfs://store.qbao.com:9000/storelog/etl/%{filename}
a1.sinks.k2.hdfs.filePrefix=%{filename}_%{date}
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.rollInterval = 60
# File size to trigger roll, in bytes (0: never roll based on file size)
a1.sinks.k2.hdfs.rollSize = 128000000
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.batchSize = 100
a1.sinks.k2.hdfs.idleTimeout=60
a1.sinks.k2.hdfs.roundValue = 1
a1.sinks.k2.hdfs.roundUnit = minute
a1.sinks.k2.hdfs.useLocalTimeStamp = true
a1.sinks.k2.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
#a1.channels.c2.type = memory
#a1.channels.c2.capacity = 1000
#a1.channels.c2.transactionCapacity = 200
a1.channels.c2.type = file
a1.channels.c2.checkpointDir=/opt/flume/checkpoint_c2
a1.channels.c2.dataDirs=/opt/flume/dataDir_c2
a1.sinks.k2.channel=c2
5.启动flume
cd /opt/flume
nohup bin/flume-ng agent -n a1 -c conf -f conf/flume-conf.properties&
ps:所需用的包
http://pan.baidu.com/s/1hshgS4G
vim /etc/profile
export FLUME_HOME=/opt/flume
export FLUME_CONF_DIR=$FLUME_HOME/conf
export PATH=$PATH:$FLUME_HOME/bin
vim /opt/flume/conf/flume-env.sh
指定文件内的jdk地址
3.检查是否安装成功
cd /opt/flume/bin/
./flume-ng version
能看到版本信息说明安装成功
4.按照前言的运用方式配置flume
vim /opt/flume/conf/flume-conf.properties
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
a1.source.r1.selector.type = replicating
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /var/log/flume_spoolDir
a1.sources.r1.deletePolicy=immediate
a1.sources.r1.basenameHeader=true
#忽略copy时的.tmp文件,避免同时读写的问题
#a1.sources.r1.ignorePattern = ^(.)*\\.tmp$
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=org.apache.flume.interceptor.RegexExtractorExtInterceptor$Builder
a1.sources.r1.interceptors.i1.regex=(.*)_(.*)\\.(.*)
a1.sources.r1.interceptors.i1.extractorHeader=true
a1.sources.r1.interceptors.i1.extractorHeaderKey=basename
a1.sources.r1.interceptors.i1.serializers=s1 s2 s3
#basename's value must be filename_date.suffix. example storelog_2015-03-16.log
a1.sources.r1.interceptors.i1.serializers.s1.name=filename
a1.sources.r1.interceptors.i1.serializers.s2.name=date
a1.sources.r1.interceptors.i1.serializers.s3.name=suffix
a1.sources.r1.channels = c1 c2
# Describe the sink
a1.sinks.k1.type =hdfs
a1.sinks.k1.hdfs.path=hdfs://store.qbao.com:9000/storelog/bak/%{date}/%{filename}
a1.sinks.k1.hdfs.filePrefix=%{filename}_%{date}
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.rollInterval = 60
# File size to trigger roll, in bytes (0: never roll based on file size)
a1.sinks.k1.hdfs.rollSize = 128000000
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 100
a1.sinks.k1.hdfs.idleTimeout=60
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
#a1.channels.c1.type = memory
#a1.channels.c1.capacity = 1000
#a1.channels.c1.transactionCapacity = 200
a1.channels.c1.type = file
a1.channels.c1.checkpointDir=/opt/flume/checkpoint_c1
a1.channels.c1.dataDirs=/opt/flume/dataDir_c1
# Bind the source and sink to the channel
a1.sinks.k1.channel = c1
# Describe the sink
a1.sinks.k2.type =hdfs
a1.sinks.k2.hdfs.path=hdfs://store.qbao.com:9000/storelog/etl/%{filename}
a1.sinks.k2.hdfs.filePrefix=%{filename}_%{date}
a1.sinks.k2.hdfs.round = true
a1.sinks.k2.hdfs.rollInterval = 60
# File size to trigger roll, in bytes (0: never roll based on file size)
a1.sinks.k2.hdfs.rollSize = 128000000
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.batchSize = 100
a1.sinks.k2.hdfs.idleTimeout=60
a1.sinks.k2.hdfs.roundValue = 1
a1.sinks.k2.hdfs.roundUnit = minute
a1.sinks.k2.hdfs.useLocalTimeStamp = true
a1.sinks.k2.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
#a1.channels.c2.type = memory
#a1.channels.c2.capacity = 1000
#a1.channels.c2.transactionCapacity = 200
a1.channels.c2.type = file
a1.channels.c2.checkpointDir=/opt/flume/checkpoint_c2
a1.channels.c2.dataDirs=/opt/flume/dataDir_c2
a1.sinks.k2.channel=c2
5.启动flume
cd /opt/flume
nohup bin/flume-ng agent -n a1 -c conf -f conf/flume-conf.properties&
ps:所需用的包
http://pan.baidu.com/s/1hshgS4G