一、简介

   hadoop是一个分布式系统基础架构,由Apache基金会所开发。用户可以在不了解分布式底层细节的情况下,开发分布式程序。充分利用集群的威力高速运算和存储。Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),简称HDFS。HDFS有高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上;而且它提供高传输率(high throughput)来访问应用程序的数据,适合那些有着超大数据集(large data set)的应用程序。HDFS放宽了(relax)POSIX的要求,可以流的形式访问(streaming access)文件系统中的数据。

二、环境说明

   以下环境由2个NameNode,5个DataNode组成

1)主机与IP对应表:

mt-hadoop-name-vip172.26.10.140
mt-hadoop-name1172.26.10.130
mt-hadoop-name2172.26.10.131
mt-hadoop-data1172.26.10.132
mt-hadoop-data2172.26.10.133
mt-hadoop-data3172.26.10.134
mt-hadoop-data4172.26.10.135
mt-hadoop-data5172.26.10.137
mt-hadoop-hive
172.26.10.138

2)操作系统及核心程序的版本:

OSUbuntu12.04.3 LTS
javajdk-7u40-linux-x64
hadoophadoop-1.2.1

3)namenode备份策略:

   nfs备份路径:mt-hadoop-hive:/srv/hadoop/name-remote

   checkpoint:mt-hadoop-name2上/srv/hadoop/namesecondary

三、hadoop安装与配置

root用户完成以下操作:

1)hadoop安装

   以下操作在所有的节点上均要进行(包括hive)

a)修改主机名,以mt-hadoop-name1为例

  sudo vi /etc/hostname

mt-hadoop-name1

b)修改hosts文件

       sudo vi /etc/hosts

127.0.0.1       localhost
172.26.10.140   mt-hadoop-name-vip
172.26.10.130   mt-hadoop-name1
172.26.10.131   mt-hadoop-name2
172.26.10.132   mt-hadoop-data1
172.26.10.133   mt-hadoop-data2
172.26.10.134   mt-hadoop-data3
172.26.10.135   mt-hadoop-data4
172.26.10.137   mt-hadoop-data5
172.26.10.138   mt-hadoop-hive

c)建立hadoop用户

   sudo addgroup hadoop

   sudo adduser -ingroup hadoop hadoop

   sudo passwd hadoop

   sudo sed -i '/^root/a\hadoop  ALL=(ALL:ALL) ALL' /etc/sudoers

d)配置hadoop用户ssh共享密钥

hadoop用户完成以下操作:

   su hadoop

   ssh-keygen -t rsa -P ""

   umask 0177

   cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

   for h in $(cat /etc/hosts|grep -v localhost|grep -v vip|grep -vw `hostname`|awk '{print $2}');do scp -r ~/.ssh hadoop@$h:;done

   umask 0022

   测试ssh访问不需要输入密码

e)配置hdfs存储磁

   root用户完成以下操作:

   sudo -s

   sed -i '$a\/dev/xvdb1 /srv/hadoop ext4 noatime,nodiratime,errors=remount-ro 0 1' /etc/fstab

   fdisk /dev/xvdb进行磁盘分区

   mkdir -p /srv/hadoop

   mount /srv/hadoop

   chown -R hadoop:hadoop /srv/hadoop

   chmod -R 755 /srv/hadoop

f)安装java

root用户完成以下操作:

   java下载地址:http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html

   tar -xzf jdk-7u25-linux-x64.gz

   cp -r /home/hadoop/software/jdk1.7.0_25 /usr/java

   chown -R root.root /usr/java/

   chmod -R 755 /usr/java/

   设置环境变量

   vi /etc/profile文件末尾增加以下行:

#set java environment
export JAVA_HOME=/usr/java
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
export PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

   启用环境变量

   source /etc/profile

   验证java

   java –version

g)安装hadoop

hadoop用户完成以下操作:

   hadoop下载地址:http://www.apache.org/dist/hadoop/core/hadoop-1.2.1/hadoop-1.2.1.tar.gz

   tar -xzf hadoop-1.2.1.tar.gz

   sudo cp -r /home/hadoop/software/hadoop-1.2.1 /usr/hadoop

   sudo chown -R hadoop:hadoop /usr/hadoop

   sudo chmod -R 755 /usr/hadoop

   设置环境变量

   vi /etc/profile文件末尾增加以下行:

#set hadoop path
export HADOOP_HOME=/usr/hadoop
export HADOOP_HOME_WARN_SUPPRESS=1
export PATH=$PATH:$HADOOP_HOME/bin

  启用环境变量

source /etc/profile

2)在mt-hadoop-hive上安装配置nfs服务端

hadoop用户完成以下操作:

   安装nfs服务端

   sudo apt-get install nfs-server

   配置nfs

   sudo vi /etc/exports文件末尾增加一行:

/srv/hadoop/name-remote mt-hadoop-name1(rw,sync,no_subtree_check) mt-hadoop-name2(rw,sync,no_subtree_check)

   创建nfs目录

   mkdir -p /srv/hadoop/name-remote

3)在mt-hadoop-name1和mt-hadoop-name2上安装nfs客户端,并挂载nfs共享目录

 hadoop用户完成以下操作:

   安装nfs客户端

   sudo apt-get install nfs-common

   创建挂载点

   mkdir -p /srv/hadoop/remote

   sudo vi /etc/fstab文件末尾增加一行:

mt-hadoop-hive:/srv/hadoop/name-remote /srv/hadoop/remote nfs   rw,tcp,intr

   sudo mount /srv/hadoop/remote

   4)hadoop集群配置

以下操作在mt-hadoop-name1上执行

使用hadoop用户完成以下操作:

    在/usr/hadoop/conf/hadoop-env.sh中添加java环境变量

# set java environment
export JAVA_HOME=/usr/java

   配置/usr/hadoop/conf/core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/srv/hadoop/tmp</value>
        <description>A base for other temporary directories.</description>
    </property>
<!-- file system properties -->
    <property>
        <name>fs.default.name</name>
        <value>hdfs://mt-hadoop-name-vip:9000</value>
    </property>
    <property>
        <name>fs.trash.interval</name>
        <value>5</value>
    </property>
</configuration>

   配置/usr/hadoop/conf/配置/usr/hadoop/conf/hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>dfs.hosts</name>
        <value>/usr/hadoop/conf/datanode-allow-list</value>
    </property>
    <property>
        <name>dfs.hosts.exclude</name>
        <value>/usr/hadoop/conf/datanode-deny-list</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.datanode.max.xcievers</name>
        <value>4096</value>
    </property>
    <property>
        <name>dfs.name.dir</name>
        <value>/srv/hadoop/name,/srv/hadoop/remote/name</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.data.dir</name>
        <value>/srv/hadoop/data</value>
        <final>true</final>
    </property>
    <property>
        <name>fs.checkpoint.dir</name>
        <value>/srv/hadoop/namesecondary</value>
        <final>true</final>
    </property>
    <property>
        <name>dfs.balance.bandwidthPerSec</name>
         <value>10485760</value>
         <description>
            Specifies the maximum bandwidth that each datanode can utilize for the balancing purpose in term of the number of bytes per second.
        </description>
    </property>
</configuration>

配置/usr/hadoop/conf/配置/usr/hadoop/conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
    <property>
        <name>mapred.hosts</name>
        <value>/usr/hadoop/conf/datanode-allow-list</value>
    </property>
    <property>
        <name>mapred.hosts.exclude</name>
        <value>/usr/hadoop/conf/datanode-deny-list</value>
    </property>
    <property>
        <name>mapred.job.tracker</name>
        <value>http://mt-hadoop-name1:9001</value>
    </property>
    <property>
        <name>mapred.child.java.opts</name>
        <value>-Xmx200m</value>
    </property>
    <property>
        <name>mapred.job.shuffle.input.buffer.percent</name>
        <value>0.3</value>
    </property>
</configuration>

   配置/usr/hadoop/conf/masters,存放namenode的主机名或IP

mt-hadoop-name2

   配置/usr/hadoop/conf/slaves,存放datanode的主机名或IP

mt-hadoop-data1
mt-hadoop-data2
mt-hadoop-data3
mt-hadoop-data4
mt-hadoop-data5

   配置/usr/hadoop/conf/datanode-allow-list,存放允许连接的datanode的主机名或IP(IP存在隐患问题,建议用主机名)

mt-hadoop-data1
mt-hadoop-data2
mt-hadoop-data3
mt-hadoop-data4
mt-hadoop-data5

   配置/usr/hadoop/conf/datanode-deny-list,存放禁止连接的datanode的主机名或IP(IP存在隐患问题,建议用主机名)

   文件为空,表示都允许。

   建立hadoop配置同步工具脚本

   vi /usr/hadoop/bin/sync-datanode-config.sh

#!/usr/bin/env bash
# copy hadoop configuration to datanode in datanode-allow-list or specified
if [ $# = 0 ];then
    ##sync namenode and datanode config
    hostlist=$(cat /usr/hadoop/conf/masters /usr/hadoop/conf/slaves|grep -vw `hostname`|tr "\n" " ")
    echo "starting copy hadoop config to {$hostlist}..."
    for host in $hostlist
    do
        scp /usr/hadoop/conf/* hadoop@$host:/usr/hadoop/conf > /dev/null 2>&1
        if [ $? = 0 ];then
            echo "copy to $host Successful"
        else
            echo "copy to $host Failure"
        fi
    done
    ##sync hadoop config on hive server
    host_hive=$(cat /etc/hosts|grep hive|awk '{print $2}')
    echo "starting copy hadoop config to {$host_hive}..."
    scp /usr/hadoop/conf/* hadoop@$host_hive:/usr/hadoop/conf > /dev/null
    if [ $? = 0 ];then
        echo "copy to $host_hive Successful"
    else
        echo "copy to $host_hive Failure"
    fi
elif [ $# = 1 ];then
    hostlist=$1
    echo "starting copy hadoop config to {$hostlist}..."
    for host in $hostlist
    do
        scp /usr/hadoop/conf/* hadoop@$host:/usr/hadoop/conf > /dev/null 2>&1
        if [ $? = 0 ];then
            echo "copy to $host Successful"
        else
            echo "copy to $host Failure"
        fi
    done
else
    echo "commond error."
    echo "eg1:  $0"
    echo 'eg2:  $0 "datanode1 datanode2"'
fi

chmod +x /usr/hadoop/bin/sync-datanode-config.sh

   同步配置

   sync-datanode-config.sh

5)hadoop集群启动与验证

使用hadoop用户完成以下操作

a)hdfs文件系统格式化

       hadoop namenode -format

b)hadoop启动与停止

       启动vip(可以通过HA实现自动切换)

       sudo ifconfig eth0:0 172.26.10.140 netmask 255.255.255.0

       启动hadoop

       start-all.sh

       停止hadoop

       stop-all.sh

c)查看DataNode情况

   hadoop dfsadmin -report

d)网页查看

   http://172.26.10.125:50030 hadoop管理
   http://172.26.10.126:50060 Task Tracker
   http://172.26.10.125:50070 HDFS状态

e)运行测试任务

   hadoop fs -mkdir /user/pset
   hadoop fs -chown pset:pset /user/pset
   hadoop fs -put /home/hadoop/software/jdk-7u25-linux-x64.gz /user/pset/
   hadoop fs -ls /user/pset

f)TestDFSIO基准测试HDFS

   写测试。

   hadoop jar $HADOOP_HOME/hadoop-test*.jar TestDFSIO -write -nrFiles 2 -fileSize 1000

   读测试。

   hadoop jar $HADOOP_HOME/hadoop-test*.jar TestDFSIO -read -nrFiles 2 -fileSize 1000

   清除测试数据。

   hadoop jar $HADOOP_HOME/hadoop-test*.jar TestDFSIO -clean

四、hive的安装与配置

   在上一章节,我们在安装hadoop时,已对hive的主机安装了hadoop(即第三章第1节 hadoop安装中的步骤)。在此基础上,我们安装hive。

    术语:Metastore

   元存储,即存放hive元数据的数据中心。默认情况下,hive service包含一个嵌入式Derby数据库作为元存储,但是因为同一时刻,只有一个嵌入式Derby数据库可以访问数据库文件,所以在同一时刻,只有1个hive的session。如果你启动第二个session,会有如下错误:

Failed to start database 'metastore_db'。要解决session问题,我们就要使用单独的数据库。这里以mysql为例。以下是权威指南中的原话:

Using an embedded metastore is a simple way to get started with Hive; however, only
one embedded Derby database can access the database files on disk at any one time,
which means you can have only one Hive session open at a time that shares the same
metastore. Trying to start a second session gives the error:
Failed to start database 'metastore_db'
when it attempts to open a connection to the metastore.

   1)安装mysql

使用root用户完成以下操作:

   安装mysql

   apt-get install mysql-server

   建立hive账号

   delete from user where user!='root' or host!='localhost';

   update user set host='%' where host='root';

   CREATE USER 'hive' IDENTIFIED BY 'hive';

   GRANT ALL PRIVILEGES ON *.* TO 'hive'@'%' WITH GRANT OPTION;

   flush privileges;

   以hive用户登录mysql,建立hive数据库

   mysql -uhive -phive    

   create database hive;

   2)安装hive

使用hadoop用户完成以下操作 :

   解压hive的tar包,放到/usr/hive

   tar -zxvf hive-0.12.0.tar.gz

   sudo cp -r hive-0.12.0/ /usr/hive

   修改目录权限

   sudo chown -R hadoop:hadoop /usr/hive

   sudo chmod 755 /usr/hive -R

   3)配置环境变量

   sudo vi /etc/profile文件末尾增加如下行:(java及hadoop的环境变量在hadoop安装时已配置)

#set hive path
export HIVE_INSTALL=/usr/hive
export PATH=$PATH:$HIVE_INSTALL/bin

   source /etc/profile

   4)hive配置

   复制mysql驱动到lib

   cp /home/hadoop/software/mysql-connector-java-5.1.18-bin.jar /usr/hive/lib

   chmod 755 /usr/hive/lib/mysql-connector-java-5.1.18-bin.jar

   chown hadoop.hadoop /usr/hive/lib/mysql-connector-java-5.1.18-bin.jar

   配置/usr/hive/conf/hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>hive.metastore.local</name>
        <value>true</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://localhost:3306/hive?characterEncoding=UTF-8</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>hive</value>
    </property>
    <property>
        <name>hive.cli.print.header</name>
        <value>true</value>
        <description>Whether to print the names of the columns in query output.</description>
    </property>
</configuration>

   5)验证

   启动hive shell

   hive

   执行以下语句测试:

   show tables;

   CREATE TABLE xp(id INT,name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

   编辑sample.txt

   vi /tmp/sample.txt

   1       zhangsan
   2       lisi
   3       test

   hive导入数据

   load data local inpath '/tmp/sample.txt' overwrite into table xp;

   select * from xp;

   quit;