http://www.kuqin.com/system-analysis/20081023/24034.html

1. Hadoop安装部署

1.1. 机器说明
总共4台机器:test161.sqa,test162.sqa, test163.sqa,test164.sqa
IP地址分别为:192.168.207.161 …… 192.168.207.164
操作系统:Redhat Linux
root用户密码:hello123

test161.sqa(192.168.207.161)作为namenode(master),其他的作为datanode(slave)

1.2. 用机器名ping通机器
用root用户登录。
在namenode和各个slave上用机器名互相ping,ping不通的话,修改/etc/hosts文件,加入如下内容:
192.168.207.161 test161.sqa
192.168.207.162 test162.sqa
192.168.207.163 test163.sqa
192.168.207.164 test164.sqa
这样应该就可以用机器名ping通了。
其他datanode机器只要保证和namenode能ping通就可以了

1.3. 新建系统hadoop用户
Hadoop要求所有机器上hadoop的部署目录结构要相同,并且都有一个相同的用户名的帐户,所以需要每台机器见一个同名的用户。
在这4台机器上建hadoop用户,密码:hadoop,默认路径/home/hadoop/。

1.4. SSH设置
Hadoop需要namenode到datanode的无密码SSH,所以需要设置namenode到其他3台datanode的无密码公钥认证方式的SSH。
首先用hadoop用户登录每台机器(包括namenode),在/home/hadoop/目录下建立.ssh目录,并将目录权设为:drwxr-xr-x,设置命令:
chmod 755 .ssh
在namenode执行入下图命令(用新建的hadoop用户登录):

输入ssh-keygen -t rsa后,
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Enter passphrase (empty for no passphrase): Enter same passphrase again:
这三个项目都直接回车。
然后将id_rsa.pub的内容复制到每个机器(也包括本机)的/home/hadoop/.ssh/authorized_keys文件中,
如果机器上已经有authorized_keys这个文件了,就在文件末尾加上id_rsa.pub中的内容,
如果没有authorized_keys这个文件,直接cp或者scp就好了,
下面的操作假设各个机器上都没有authorized_keys文件。
具体命令:
在namenode执行(用新建的hadoop用户登录):
cp /home/hadoop/.ssh/id_rsa.pub /home/hadoop/.ssh/authorized_keys scp authorized_keys test162.sqa:/home/hadoop/.ssh/
此处的scp就是通过ssh进行远程copy,此处需要输入远程主机的密码,即test162.sqa机器上hadoop帐户的密码(hadoop),
当然,也可以用其他方法将authorized_keys文件拷贝到其他机器上。另外2台datanode也一样拷贝。
scp authorized_keys test163.sqa:/home/hadoop/.ssh/scp authorized_keys test164.sqa:/home/hadoop/.ssh/
用hadoop用户登录每台机器,修改/home/hadoop/.ssh/authorized_keys文件的权限为:-rw-r–r–,设置命令:
cd /home/hadoop/.sshchmod 644 authorized_keys
设置完成后,测试一下namenode到各个节点的ssh链接,包括到本机,如果不需要输入密码就可以ssh登录,说明设置成功了。
其他机器一样测试:
ssh test162.sqassh test163.sqassh test164.sqa

1.5. 安装JDK
到sun网站下载JDK安装包,并在每台机器的root用户下面安装。下面以实例简单描述一下如何安装:
下载JDK的rpm包jdk-6u6-linux-i586-rpm.bin
chmod u+x ./jdk-6u6-linux-i586-rpm.bin./ jdk-6u6-linux-i586-rpm.bin rpm -ivh jdk-6u6-linux-i586.rpm
安装软件会将JDK自动安装到 /usr/java/jdk1.6.0_07目录下。安装完成后,设置JDK的环境变量,
考虑到JDK可能会有其他系统用户也会用到,建议将环境变量直接设置在/etc/profile中具体内容:
export JAVA_HOME=/usr/java/jdk1.6.0_07
export CLASSPATH=$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
export PATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH:$HOME/bin

1.6. 设置目录并安装Hadoop
用hadoop用户登录namenode,并新建一个目录,用于存放所有hadoop相关内容。
本例中在/home/hadoop目录下新建HadoopInstall下载hadoop安装包
http://apache.mirror.phpchina.com/hadoop/core/hadoop-0.16.3/hadoop-0.16.3.tar.gz, 存放到namenode的hadoop用户的/home/hadoop/HadoopInstall并解压缩:
tar zxvf hadoop-0.16.3.tar.gz
考虑到今后升级以及其他操作的方便性,建议建一个名称为hadoop的链接,指向hadoop-0.16.3目录:
ln -s hadoop-0.16.3 hadoop
新建目录:/home/hadoop/HadoopInstall/hadoop-conf
将/home/hadoop/HadoopInstall/hadoop/conf目录下的hadoop_site.xml,slaves,hadoop_env.sh,
masters文件拷贝到/home/hadoop/HadoopInstall/hadoop-conf目录
在/home/dbrg/.bashrc文件中设置环境变量$HADOOP_CONF_DIR:
export HADOOP_CONF_DIR=$HOME/HadoopInstall/hadoop-conf/

1.7. Hadoop环境变量设置和配置文件修改在/home/hadoop/HadoopInstall/hadoop-conf/hadoop_env.sh文件中设置环境变量:
export JAVA_HOME=/usr/java/jdk1.6.0_06export HADOOP_HOME=/home/hadoop/HadoopInstall/hadoop
在/home/hadoop/HadoopInstall/hadoop-conf/masters文件中设置namenode:
文件内容:
test161.sqa
在/home/hadoop/HadoopInstall/hadoop-conf/slaves文件中设置datanode:
文件内容:
test162.sqatest163.sqatest164.sqa
在/home/hadoop/HadoopInstall/hadoop-conf/ hadoop-site.xml文件中设置hadoop配置:


fs.default.name 
 
 test161.sqa:9000The name of the default file system. Either the literal 
 
 string “local” or a host:port for DFS. mapred.job.tracker 
 
 test161.sqa:9001The host and port that the MapReduce job tracker runs at. 
 
 If “local”, then jobs are run in-process as a single map and reduce task. hadoop.tmp.dir/home/hadoop/HadoopInstall/tmp 
 
 A base for other temporary directories. dfs.name.dir 
 
 /home/hadoop/HadoopInstall/filesystem/nameDetermines where on the local filesystem 
 
 the DFS name node should store the name table. If this is a comma-delimited list of directories then the name table 
 
 is replicated in all of the directories, for redundancy. dfs.data.dir 
 
 /home/hadoop/HadoopInstall/filesystem/dataDetermines where on the local filesystem 
 
 an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored 
 
 in all named directories, typically on different devices. Directories that do not exist are ignored. dfs.replication1Default block replication. 
 
 The actual number of replications can be specified when the file is created. The default is used if replication is 
 
 not specified in create time.


1.8. 部署datanode节点
将namenode上安装配置完成的hadoop文件拷贝到所有datanode:
scp -r /home/hadoop/HadoopInstall test162.sqa:/home/hadoop/scp -r /home/hadoop/HadoopInstall
test163.sqa:/home/hadoop/scp -r /home/hadoop/HadoopInstall test164.sqa:/home/hadoop/



1.9. 启动Hadoop
格式化namenode
/home/hadoop/HadoopInstall/hadoop/bin/hadoop namenode -format
在/home/hadoop/HadoopInstall/hadoop/bin/下面有很多启动脚本,可以根据自己的需要来启动:
* start-all.sh 启动所有的Hadoop守护。包括namenode, datanode, jobtracker, tasktrack
* stop-all.sh 停止所有的Hadoop。
* start-mapred.sh 启动Map/Reduce守护。包括Jobtracker和Tasktrack。
* stop-mapred.sh 停止Map/Reduce守护
* start-dfs.sh 启动Hadoop DFS守护.Namenode和Datanode
* stop-dfs.sh 停止DFS守护


 

在这里,简单启动所有守护
bin/start-all.sh
同样,如果要停止hadoop,则
bin/stop-all.sh

1.10. HDFS测试
运行bin/目录的hadoop命令,可以查看Haoop所有支持的操作及其用法,这里以几个简单的操作为例。
在HDFS建立目录:
bin/hadoop dfs -mkdir testdir
在HDFS中建立一个名为testdir的目录

复制文件到HDFS:
bin/hadoop dfs -put /home/hadoop/large.zip testfile.zip
把本地文件large.zip拷贝到HDFS的根目录/user/hadoop/下,文件名为testfile.zip

查看HDFS的现有文件
bin/hadoop dfs -ls

1.11. C++测试程序
分别用c++编写mapper和reducer程序,完成对文件中的单词数量的统计:

mapper.cpp:
 // c++ map reduce Mapper
 // word count example
 // 2008.4.18
 // by iveney
 #include 
 #include 
 using namespace std;int main()
 {
 string buf;
 while( cin>>buf )
 cout< return 0;
 }reducer.cpp:
 #include 
 #include 
 #include
using namespace std;int main()
 {
 map dict;
 map::iterator iter;
 string word;
 int count;
 while( cin>>word>>count )
 dict[word]+=count;
 iter = dict.begin();
 while( iter != dict.end() )
 {
 cout<first<<” ”<second< iter++;
 }
 return 0;
 }

编译两个源文件:
g++ mapper.cpp -o mapperg++ reducer.cpp -o reducer
简单测试一下这两个程序:echo "ge abc ab df " | ./mapper | ./reducer
输出:ab 1abc 1df 1hadoop
运行测试bin/hadoop dfs -mkdir inputbin/hadoop dfs -put /home/hadoop/ap_base_session_fatmt0.txt
inputbin/hadoop jar contrib/streaming/hadoop-0.16.3-streaming.jar -mapper
/home/hadoop/hdfile/mapper -reducer /home/hadoop/hdfile/reducer -input input/ap_base_session_fatmt0.txt -output output
注意:
1、 用streaming方式运行测试程序,执行的是本地文件,所以要讲mapper和reducer
两个程序拷贝到namenode和每个datanode的相同目录下。本例是将mapper和reducer拷贝到每台机器的
/home/hadoop/hdfile目录下。
2、 处理的文本文件要先传到HDFS中,本例是将ap_base_session_fatmt0.txt传到HDFS的input目录下。
3、 如果要重新运行测试,需要将HDFS上的output目录删除,否则会报output目录已经存在的错误。
删除命令:bin/hadoop dfs -rmr output

 ==============

http://www.51testing.com/?uid-13997-action-viewspace-itemid-90965

很多人听说google 的云计算,基础mapreduce、gfs,但都停留于纸面。apache和yahoo 合作有一个类似项目hadoop,国内已经有实际公司在应用,如阿里妈妈,国外有hive项目。

说这么多,不如实际部署一个体验下。

 

hadoop要求sun jdk1.5或者以上,linux 平台。

更多参考http://www.infoq.com/cn/articles/hadoop-config-tip
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)#Prerequisites

 

分布式部署

参考Hadoop Cluster Setup

 

有2台机器10.0.4.145 (NameNode和JobTracker角色),10.0.4.146(DataNode和TaskTracker的角色)。

都建立同样的目录/home/search/hadoop-0.17.1。

 

在主节点145上 [search@b2bsearch145 ~]$ cat .bash_profile

export HADOOP_HOME=/home/search/hadoop-0.17.1

export PATH=$PATH:$HADOOP_HOME/bin

 

方便命令行操作

1.1

建立Master到每一台Slave的SSH受信证书。

 

确保~/.ssh/authorized_keys的文件权限为600

[search@b2bsearch145 hadoop-0.17.1]$ ll ~/.ssh/authorized_keys

-rw-r--r-- 1 search search 2529 6月25 10:01 /home/search/.ssh/authorized_keys

[search@b2bsearch145 hadoop-0.17.1]$ cat ~/.ssh/authorized_keys |grep 146

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA7naNGEpcbuon2/4M+0FDRp594MNk7jV0U3SaDLlT4vLvo0viCSP/2mEMi7iadaogkSr3FbIHryUsOhZ1MSwiDc2nv3TgxAh3K/jQkbP1MDGdHzOVvScrWcTfpFhDtL29HQJit5fpST0aZDlbCn8LsYX+y171Pun9Q4HyT9TkUL0= search@alitest146

ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAIEA06pe9YZTEEqmiutmjWQ1CgnmOWd3xh2YkqDinSuZi7t/Uyg/u/l0vJ5nv196dnYqdJJTyaVUU+ydcS7UJu+ykpeIYZGSL6XC2MqTMCpEVAtqP9WUhFXToJmq0tDrlYTfnYZOCIrDt+hjp+c7E7EH3phtEHdrlaAs9ZvcM/6/4L0= search@intl_search38146

 

1.2

解压后进入conf目录,主要需要修改以下文件:hadoop-env.shhadoop-site.xmlmastersslaves

 

默认的hadoop-default.xml仅仅更改dfs.permissions.supergroup属性值为当前用户组名称。bash -c groups获取group名称

<property>
 <name>dfs.permissions.supergroup</name>
  <value>search</value>
 <descrīption>The name of the group of super-users.</descrīption>
</property>

hadoop-env.sh仅仅更改JAVA_HOME

 

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/ali/jdk1.6

一个可用的hadoop-site.xml

 

 

[search@b2bsearch145 conf]$ vi hadoop-site.xml

 

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://10.0.4.145:54310/</value></property>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://10.0.4.145:54311/</value>
</property>
<property>  <name>dfs.replication</name>
<value>1</value></property><property>
<name>hadoop.tmp.dir</name>
<value>/home/search/hadoop-0.17.1/tmp/</value></property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx512m</value></property>
<property> <name>dfs.block.size</name>
<value>5120000</value> <descrīption>The default block size for new files.</descrīption>
</property></configuration>
 
~                                          
[search@b2bsearch145 conf]$ cat masters
10.0.4.145
[search@b2bsearch145 conf]$ cat slaves
10.0.4.146

 

1.3   初始化dfs以及启动进程

 

[search@b2bsearch145 hadoop-0.17.1]$ bin/hadoop namenode -format

08/08/21 19:48:54 INFO dfs.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG:  host = b2bsearch145/10.0.4.145

STARTUP_MSG:  args = [-format]

STARTUP_MSG:  version =0.17.1

STARTUP_MSG:  build = http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 669344; compiled by 'hadoopqa' on Thu Jun 19 01:18:25 UTC 2008

************************************************************/

08/08/21 19:48:54 INFO fs.FSNamesystem: fsOwner=search,search

08/08/21 19:48:54 INFO fs.FSNamesystem: supergroup=search

08/08/21 19:48:54 INFO fs.FSNamesystem: isPermissionEnabled=true

08/08/21 19:48:54 INFO dfs.Storage: Storage directory /home/search/hadoop-0.17.1/tmp/dfs/name has been successfully formatted.

08/08/21 19:48:54 INFO dfs.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at b2bsearch145/10.0.4.145

************************************************************/

 

[search@b2bsearch145 hadoop-0.17.1]$ bin/start-dfs.sh
starting namenode, logging to /home/search/hadoop-0.17.1/bin/../logs/hadoop-search-namenode-b2bsearch145.out
10.0.4.146: starting datanode, logging to /home/search/hadoop-0.17.1/bin/../logs/hadoop-search-datanode-alitest146.out
10.0.4.145: starting secondarynamenode, logging to /home/search/hadoop-0.17.1/bin/../logs/hadoop-search-secondarynamenode-b2bsearch145.out
[search@b2bsearch145 hadoop-0.17.1]$ bin/start-mapred.sh
starting jobtracker, logging to /home/search/hadoop-0.17.1/bin/../logs/hadoop-search-jobtracker-b2bsearch145.out
 
10.0.4.146: starting tasktracker, logging to /home/search/hadoop-0.17.1/bin/../logs/hadoop-search-tasktracker-alitest146.out
 
[search@b2bsearch145 hadoop-0.17.1]$ /usr/ali/jdk1.6/bin/jps
18390 NameNode
18589 JobTracker
18721 Jps
18521 SecondaryNameNode

 

 

1.4

[search@b2bsearch145 hadoop-0.17.1]$ bin/hadoop dfs -copyFromLocal input/ test-in

[search@b2bsearch145 hadoop-0.17.1]$ bin/hadoop dfs -ls

Found 1 items

/user/search/test-in   <dir>          2008-08-21 19:53       rwxr-xr-x      search search

[search@b2bsearch145 hadoop-0.17.1]$ bin/hadoop dfs -ls /user/search/test-in

Found 2 items

/user/search/test-in/hadoop-default.xml <r 1>  37978  2008-08-21 19:53       rw-r--r--      search search

/user/search/test-in/hadoop-site.xml   <r 1>  178    2008-08-21 19:53       rw-r--r--      search search

[search@b2bsearch145 hadoop-0.17.1]$ bin/hadoop dfs -cat /user/search/test-in/hadoop-default.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

 

[search@b2bsearch145 hadoop-0.17.1]$ bin/hadoop jar hadoop-0.17.1-examples.jar wordcount /user/search/test-in test-out

08/08/21 19:55:45 INFO mapred.FileInputFormat: Total input paths to process : 2

08/08/21 19:55:46 INFO mapred.JobClient: Running job: job_200808211951_0001

08/08/21 19:55:47 INFO mapred.JobClient: map 0% reduce 0%

08/08/21 19:55:52 INFO mapred.JobClient: map 66% reduce 0%

08/08/21 19:55:54 INFO mapred.JobClient: map 100% reduce 0%

08/08/21 19:56:01 INFO mapred.JobClient: map 100% reduce 100%

08/08/21 19:56:02 INFO mapred.JobClient: Job complete: job_200808211951_0001

08/08/21 19:56:02 INFO mapred.JobClient: Counters: 16

08/08/21 19:56:02 INFO mapred.JobClient:  File Systems

08/08/21 19:56:02 INFO mapred.JobClient:    Local bytes read=36202

08/08/21 19:56:02 INFO mapred.JobClient:    Local bytes written=72658

08/08/21 19:56:02 INFO mapred.JobClient:    HDFS bytes read=39559

08/08/21 19:56:02 INFO mapred.JobClient:    HDFS bytes written=19133

08/08/21 19:56:02 INFO mapred.JobClient:  Job Counters

08/08/21 19:56:02 INFO mapred.JobClient:    Launched map tasks=3

08/08/21 19:56:02 INFO mapred.JobClient:    Launched reduce tasks=1

08/08/21 19:56:02 INFO mapred.JobClient:    Data-local map tasks=3

08/08/21 19:56:02 INFO mapred.JobClient:  Map-Reduce Framework

08/08/21 19:56:02 INFO mapred.JobClient:    Map input records=1239

08/08/21 19:56:02 INFO mapred.JobClient:    Map output records=3888

08/08/21 19:56:02 INFO mapred.JobClient:    Map input bytes=38156

08/08/21 19:56:02 INFO mapred.JobClient:    Map output bytes=51308

08/08/21 19:56:02 INFO mapred.JobClient:    Combine input records=3888

08/08/21 19:56:02 INFO mapred.JobClient:    Combine output records=1428

08/08/21 19:56:02 INFO mapred.JobClient:    Reduce input groups=1211

08/08/21 19:56:02 INFO mapred.JobClient:    Reduce input records=1428

08/08/21 19:56:02 INFO mapred.JobClient:    Reduce output records=1211

 

1.5

[search@b2bsearch145 hadoop-0.17.1]$ bin/hadoop dfs -ls  /user/search/test-out 

Found 2 items

/user/search/test-out/_logs    <dir>          2008-08-21 19:55       rwxr-xr-x      search search

/user/search/test-out/part-00000       <r 1>  19133  2008-08-21 19:55       rw-r--r--      search search

[search@b2bsearch145 hadoop-0.17.1]$ bin/hadoop dfs -cat /user/search/test-out/part-00000 |more

"_logs/history/"       1

"all".</descrīption>   1

"block"(trace  1

"dir"(trac     1

"false",       1

"local",       1

 

更多命令参考:Hadoop Shell Commands

 

 

1.6   Web UI浏览NameNode

 

 

1.7   Web UI浏览DataNode

 

 

 

1.8   WEB UI浏览JobTracker

 

 

 

1.9   Web UI浏览TaskTracker

 

1.10

 

后续比较有意思的一些使用体验也将陆续上传。