hive NULL生成方式:
1 文本方式:
如果通过 load data local inpath '/opt/stu.txt' into table stu 这种上传本地文件数据方式到Hive制造hive NULL的时候,需要在 stu.txt中 用
id name
1 zhangsan
2 \N
写法,而不是
id name
1 zhangsan
2 null ----> 这样入hive后的数据是字符串 null
2 数据库导入方式:
如果通过关系库sqoop导入到hdfs后通过hive外表关联方式:
类型 | 关系型数据库 | hive |
string | null | \N |
string | "" | "" |
int等其余类型 | null | \N |
对应hive string类型查询:
如果为 \N 使用is null查询 eg: where b.name is null;
如果为 “” 使用=''查询 eg: where b.name ='';
hive中 '' 和 NULL 在表关联下 是否会产生多余数据:
先以mysql为例讲解:
insert into stu(id,name) values('1','zm1');
insert into stu(id) values('2');
insert into stu(id) values('3');
insert into stu1(id,name,age) values('4','zm1','31');
insert into stu1(id,age) values('5','32');
insert into stu1(id,age) values('6','33');
select a.*, b.* from stu a join stu1 b on a.name = b.name;
mysql> select * from stu;
+------+------+
| id | name |
+------+------+
| 1 | zm1 |
| 2 | NULL |
| 3 | NULL |
+------+------+
mysql> select * from stu1;
+------+------+------+
| id | name | age |
+------+------+------+
| 4 | zm1 | 31 |
| 5 | NULL | 32 |
| 6 | NULL | 33 |
+------+------+------+
mysql> select a.*, b.* from stu a join stu1 b on a.name = b.name;
+------+------+------+------+------+
| id | name | id | name | age |
+------+------+------+------+------+
| 1 | zm1 | 4 | zm1 | 31 |
+------+------+------+------+------+
!!!!!!!!!!!!!mysql会进行优化 将on中name为null的去掉!!!!!!!!!!!!
将上述stu stu1数据做修改, update stu set name='' where id = '2' 都修改后展示时mysql为如下样子:
mysql> select * from stu;
+------+------+
| id | name |
+------+------+
| 1 | zm1 |
| 2 | |
| 3 | |
+------+------+
3 rows in set (0.00 sec)
mysql> select * from stu1;
+------+------+------+
| id | name | age |
+------+------+------+
| 4 | zm1 | 31 |
| 5 | | 32 |
| 6 | | 33 |
+------+------+------+
3 rows in set (0.00 sec)
mysql> select a.*, b.* from stu a join stu1 b on a.name = b.name;
+------+------+------+------+------+
| id | name | id | name | age |
+------+------+------+------+------+
| 1 | zm1 | 4 | zm1 | 31 |
| 2 | | 5 | | 32 |
| 3 | | 5 | | 32 |
| 2 | | 6 | | 33 |
| 3 | | 6 | | 33 |
+------+------+------+------+------+
5 rows in set (0.00 sec)
!!!!!!!!!!!!!mysql不会对''做优化 关联查询时候会被关联到!!!!!!!!!!!!
讲mysql ‘’ 和 NULL 是因为 经过试验操作 hive操作结果和 mysql操作结果一样,
即:
''下hive关联操作的字段会被作为关联条件,这样会产生很多垃圾数据,在ETL中数据做了预处理后,
建议查询条件增加非空串判断
eg: from stu a join tea b on a.name = b.name and a.name !='' and b.name != '' ;
NULL下hive关联操作的字段不会作为关联条件.