1.产生背景

1.MR来开发业务逻辑:繁琐、痛苦
2.通过SQL来完成大数据的统计分析
任何一个框架的诞生必然是由于某类问题的存在。

2.介绍

Apache Hive数据仓库软件支持使用SQL读取、写入和管理驻留在分布式存储中的大型数据集。结构可以投影到已经存储的数据上。提供了一个命令行工具和JDBC驱动程序来将用户连接到Hive。
最初是由facebook管理,为了解决海量的结构化日志的统计问题,后将Hive的代码托管到GitHub,刚开始时是作为Hadoop项目的一个子项目的,后面才单独成为一个项目。Hive是构建在Hadoop之上的数据仓库 适合处理离线,底层支持的引擎:MR/Spark/Tez,是一个类SQL的框架, HQL和SQL之间没有任何关系,HQL是仿照SQL参照写的。Hive是一个客户端,不是一个集群,能够将SQL翻译成MR/Spark,提交到Hadoop集群上去运行。
Hive拥有统一的元数据管理,Spark SQL/Hive/Impala/Presto 之间的数据可以相互转换。什么是元数据呢,元数据就是描述数据的数据,例如:一张表的表名,字段名,字段类型……都称为元数据。Hive的数据文件存储在HDFS上,元数据则存储到MySQL中。

3.安装

上传Hive的bin包到服务器
[hadoop@hadoop001 software]$ ll
total 12
drwxrwxr-x 10 hadoop hadoop 4096 Jul 15 19:28 hadoop-2.6.0-cdh5.15.1
drwxr-xr-x 11 hadoop hadoop 4096 Aug  9  2018 hive-1.1.0-cdh5.15.1
解压文件到对应的目录
[hadoop@hadoop001 software]$ tar -zxvf hive-1.1.0-cdh5.15.1.tar.gz -C ~/app/
配置系统环境变量
export HIVE_HOME=/home/hadoop/app/hive
export PATH=${HIVE_HOME}/bin:${HiVE__HOME}/sbin:$PATH
添加mysql-connector-java-5XXX.jar 到HIVE_HOME/lib目录下,这是连接mysql跟hive的驱动文件
[hadoop@hadoop001 lib]$ ll mysql-connector-java-5.1.47.jar 
-rw-rw-r-- 1 hadoop hadoop 1007502 Aug  7  2018 mysql-connector-java-5.1.47.jar
修改hive-site.xml 文件,一般里面是没有内容的,需要自己添加
[hadoop@hadoop001 conf]$ cat hive-site.xml 
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://hadoop001:3306/ruozedata_d7?createDatabaseIfNotExist=true</value>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>root</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>123456</value>
</property>

<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>

<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>

</configuration>
启动Hive
[hadoop@hadoop001 conf]$ hive
which: no hbase in (/home/hadoop/app/hive/bin:/sbin:/home/hadoop/app/hadoop/bin:/home/hadoop/app/hadoop/sbin:/usr/loca/mysql/bin:/usr/java/jdk1.8.0_45/bin/:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin)

Logging initialized using configuration in file:/home/hadoop/software/hive-1.1.0-cdh5.15.1/conf/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive (default)>

至此,Hive的安装完成。

4.数据查看

Hive的数据存放在HDFS位置,这个位置是由参数:hive.metastore.warehouse.dir决定的,可以在 hive-site.xml 修改默认文件存放路径

<property>
<name>hive.metastore.warehouse.dir</name>
<value>新路径</value>
</property>
[hadoop@hadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse/ruozedata
Found 1 items
-rwxr-xr-x   1 hadoop supergroup         36 2019-07-16 16:59 /user/hive/warehouse/ruozedata/people.txt

Hive的元数据存放在MySQL中,创建的库默认在hive-site.xml说明了,这边我们创建了ruozedata_d7数据库。

mysql> use ruozedata_d7
Database changed
mysql> show tables;
+---------------------------+
| Tables_in_ruozedata_d7    |
+---------------------------+
| bucketing_cols            |
| cds                       |
| columns_v2                |
| database_params           |
| dbs                       |
| func_ru                   |
| funcs                     |
| global_privs              |
| part_col_stats            |
| partition_key_vals        |
| partition_keys            |
| partition_params          |
| partitions                |
| roles                     |
| sd_params                 |
| sds                       |
| sequence_table            |
| serde_params              |
| serdes                    |
| skewed_col_names          |
| skewed_col_value_loc_map  |
| skewed_string_list        |
| skewed_string_list_values |
| skewed_values             |
| sort_cols                 |
| tab_col_stats             |
| table_params              |
| tbls                      |
| version                   |
+---------------------------+
29 rows in set (0.00 sec)
查看数据库在HDFS对应的文件路径
mysql> select * from dbs;
+-------+-----------------------+-------------------------------------------+---------+------------+------------+
| DB_ID | DESC                  | DB_LOCATION_URI                           | NAME    | OWNER_NAME | OWNER_TYPE |
+-------+-----------------------+-------------------------------------------+---------+------------+------------+
|     1 | Default Hive database | hdfs://hadoop001:9000/user/hive/warehouse | default | public     | ROLE       |
+-------+-----------------------+-------------------------------------------+---------+------------+------------+
1 row in set (0.00 sec)
查看HIVE版本信息
mysql> select * from version;
+--------+-----------------+-------------------+-----------------------------------+
| VER_ID | SCHEMA_VERSION  | SCHEMA_VERSION_V2 | VERSION_COMMENT                   |
+--------+-----------------+-------------------+-----------------------------------+
|      1 | 1.1.0-cdh5.15.1 | NULL              | Set by MetaStore hadoop@10.9.0.62 |
+--------+-----------------+-------------------+-----------------------------------+
1 row in set (0.00 sec)
查看对应的表数据
mysql> select * from tbls;
+--------+-------------+-------+------------------+--------+-----------+-------+-----------+---------------+--------------------+--------------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER  | RETENTION | SD_ID | TBL_NAME  | TBL_TYPE      | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT |
+--------+-------------+-------+------------------+--------+-----------+-------+-----------+---------------+--------------------+--------------------+
|      1 |  1563267428 |     1 |                0 | hadoop |         0 |     1 | ruozedata | MANAGED_TABLE | NULL               | NULL               |
+--------+-------------+-------+------------------+--------+-----------+-------+-----------+---------------+--------------------+--------------------+
1 row in set (0.00 sec)
查看对应表的字段信息
mysql> select * from columns_v2;
+-------+---------+-------------+-----------+-------------+
| CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX |
+-------+---------+-------------+-----------+-------------+
|     1 | NULL    | age         | int       |           2 |
|     1 | NULL    | id          | int       |           0 |
|     1 | NULL    | name        | string    |           1 |
+-------+---------+-------------+-----------+-------------+
3 rows in set (0.00 sec)