2014-3-10
【需求】
接受的工作需要处理海量数据,第一步先用工具做一些运营数据的产出,考虑采用hadoop方便以后跟随数据量变大可以补充机器,而不用动统计逻辑。
当前的hadoop社区非常活跃,hadoop周边工具不断出新,以下是部分热门工具的初步了解:
- 数据存储
hadoop,包含hdfs和mapreduce
hbase,支持大表,需要zk
zookeeper,分布式集群管理,简称zk
- 数据传输
flume/sribe/Chukwa 分布式日志收集系统,从多个机器汇总到一个节点
sqoop,传统db和hdfs/hbase之间数据传输
- 主要查询接口
hive,一个SQL查询接口
pig,一个脚本查询接口
hadoop流,标准输入输出的Mapreduce,使用脚本语言编写逻辑代码 shell/python/php/ruby
hadoop pipe,socket输入输出,使用C++编写逻辑代码
- 其他辅助工具
avro,序列化工具
oozie,把几个mr作业连一起
snappy,压缩工具
mahout,机器学习工具集
当然最新的工具层出不穷,比如Spark,Impala。现在的需求是打起单机伪分布hadoop,然后当业务数据量增大时候,比较平滑切入多机分布。
本着小步快跑的互联网思想,在没太多实际经验下,先做简单搭建,后期在不断补充和调式加入新工具。
初期搭建的有:hadoop(一切的必须),hive(查询方便),hadoop流的简单包装(为了方便使用脚本语言),sqoop(从传统db导数据),pig(尝试使用,不必须)。
后期搭建的可能包括:zookeeper,hbase(支持亿级别大表),mahout(机器学习) 等。
[PS]工作环境 64位的 Ubuntu 12.04,线下实验是桌面版,实际操作的是服务器版;
【Java 7】
首先是Java的安装,Ubuntu可能默认装了openjdk,最好还是用oracle java,故卸之
sudo apt-get purge openjdk*
sudo apt-get install software-properties-common
sudo apt-get install python-software-properties
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer
装完了要加入JAVA_HOME到环境变量中。众多设置环境变量的文件中,如/etc/profile,~/.bashrc等,Ubuntu推荐设置在/etc/enviroment,但是笔者用了之后出现各种诡异,可能是格式问题,暂时推荐放在/etc/profile,这也是Programming Hive里面使用的。
export JAVA_HOME="/usr/lib/jvm/java-7-oracle/"
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib
export PATH=$JAVA_HOME/bin:$PATH
. /etc/profile
【Hadoop搭建】
各种不同的发行版有官方版本,Cloudera,MapR等免费开源版,商业版就不说了,反正用不上。一个数据:国内公司75%用cloudera,因为方便 vie 利用Cloudera实现Hadoop,官方版的安装说明
笔者尝试用Apache的Hadoop版本,在64位Ubuntu上搭建最新稳定版2.2.0(2014/3/10),居然直接拿来用的库文件不支持64位,要自己编译,这水就很深了,编译这种系统总是缺胳膊少腿的,不是少了编译工具,就是少了依赖。实际情况中碰到各种bug,各种问题不断压栈,使用成本不小。大概花了一天才把Hadoop搭建起来,而且感觉这样东拼西凑,感觉随时可以崩溃。遂决定换Cloudera的CDH4。
在Cloudera官方网站查得支持的各种软硬件条件,支持我的64位的Ubuntuuname -a 和 cat /etc/issue)
# 官方伪分布安装说明 就几步:
wget http://archive.cloudera.com/cdh4/one-click-install/precise/amd64/cdh4-repository_1.0_all.deb
sudo dpkg -i cdh4-repository_1.0_all.deb
curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -
sudo apt-get update
sudo apt-get install hadoop-conf-pseudo #这一步会列出要安装的软件,包括zookeeper,受网速影响可能比较慢,可以用nohup放到后台运行
sudo -u hdfs hdfs namenode -format #格式化NameNode.
# 启动HSFS
for x in `cd /etc/init.d ; ls hadoop-hdfs-*` ; do sudo service $x start ; done
# 创建/tmp目录和YARN与日志目录
sudo -u hdfs hadoop fs -rm -r /tmp
sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
sudo -u hdfs hadoop fs -mkdir /tmp/hadoop-yarn/staging
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging
sudo -u hdfs hadoop fs -mkdir /tmp/hadoop-yarn/staging/history/done_intermediate
sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate
sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging
sudo -u hdfs hadoop fs -mkdir /var/log/hadoop-yarn
sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn
# check目录
sudo -u hdfs hadoop fs -ls -R /
结果应该为:
drwxrwxrwt - hdfs supergroup 0 2012-05-31 15:31 /tmp
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /tmp/hadoop-yarn
drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging
drwxr-xr-x - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history
drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history/done_intermediate
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var
drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var/log
drwxr-xr-x - yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn
# Start YARN
sudo service hadoop-yarn-resourcemanager start
sudo service hadoop-yarn-nodemanager start
sudo service hadoop-mapreduce-historyserver start
#Create User Directories
sudo -u hdfs hadoop fs -mkdir /user/danny
sudo -u hdfs hadoop fs -chown danny /user/danny
实际格式为
sudo -u hdfs hadoop fs -mkdir /user/<user>
sudo -u hdfs hadoop fs -chown <user> /user/<user>
#Running an example application with YARN
hadoop fs -mkdir input
hadoop fs -put /etc/hadoop/conf/*.xml input
hadoop fs -ls input
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'
hadoop fs -ls
hadoop fs -ls output23
hadoop fs -cat output23/part-r-00000 | head
结果应该为
1 dfs.safemode.min.datanodes
1 dfs.safemode.extension
1 dfs.replication
1 dfs.permissions.enabled
1 dfs.namenode.name.dir
1 dfs.namenode.checkpoint.dir
1 dfs.datanode.data.dir
【hive】
安装mysql
sudo apt-get install hive hive-metastore hive-server
sudo apt-get install mysql-server
sudo service mysql start
如果需要修改密码
$ sudo /usr/bin/mysql_secure_installation
[...]
Enter current password for root (enter for none):
OK, successfully used password, moving on...
[...]
Set root password? [Y/n] y
New password:
Re-enter new password:
Remove anonymous users? [Y/n] Y
[...]
Disallow root login remotely? [Y/n] N
[...]
Remove test database and access to it [Y/n] Y
[...]
Reload privilege tables now? [Y/n] Y
All done!
To make sure the MySQL server starts at boot
需要apt get sysv-rc-conf (替代chkconfig的)
sudo apt-get install sysv-rc-conf
sudo sysv-rc-conf mysql on
创建metastore库,注册一个用户,授权
$ mysql -u root -p
Enter password:
mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.10.0.mysql.sql;
mysql> create user 'hive'@'%' identified by 'hive';
mysql> create user 'hive'@'localhost' identified by 'hive';
mysql> revoke all privileges, grant option from 'hive'@'%';
mysql> revoke all privileges, grant option from 'hive'@'localhost';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'%';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE,LOCK TABLES,EXECUTE ON metastore.* TO 'hive'@'localhost';
mysql> FLUSH PRIVILEGES;
mysql> quit;
安装mysql-connector-java and symbolically link the file into the /usr/lib/hive/lib/
sudo apt-get install libmysql-java
sudo ln -s /usr/share/java/libmysql-java.jar /usr/lib/hive/lib/libmysql-java.jar
Configure the Metastore Service to Communicate with the MySQL Database,配置 hive-site.xml
sudo cp /etc/hive/conf/hive-site.xml /etc/hive/conf/hive-site.xml.bak
sudo vim /etc/hive/conf/hive-site.xml
修改为
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/metastore_db?createDatabaseIfNotExist=true</value>
<description>the URL of the MySQL database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
</property>
启动和初始化文件
sudo service hive-metastore start
sudo service hive-server start
sudo -u hdfs hadoop fs -mkdir /user/hive
sudo -u hdfs hadoop fs -chown hive /user/hive
sudo -u hdfs hadoop fs -mkdir /tmp
sudo -u hdfs hadoop fs -chmod 777 /tmp #already exist
sudo -u hdfs hadoop fs -chmod o+t /tmp
sudo -u hdfs hadoop fs -mkdir /data
sudo -u hdfs hadoop fs -chown hdfs /data
sudo -u hdfs hadoop fs -chmod 777 /data
sudo -u hdfs hadoop fs -chmod o+t /data
sudo chown -R hive:hive /var/lib/hive