LIMIT 语句
hive (default)> SELECT upper(name), salary,deductions["Federal Taxes"], round(salary * (1 - deductions["Federal Taxes"])) FROM employees limit 2;
JOHN DOE 100000.0 0.2 80000.0
MARY SMITH 80000.0 0.2 64000.0
列别名
hive 查询要显示列名,需要设置一下参数
hive (default)> set hive.cli.print.header=true;
hive (default)> SELECT upper(name) AS u_name, salary,deductions["Federal Taxes"] AS d_federal_taxes, round(salary * (1 - deductions["Federal Taxes"])) AS after_taxes_Percentage FROM employees limit 2;
u_name salary d_federal_taxes after_taxes_percentage
JOHN DOE 100000.0 0.2 80000.0
MARY SMITH 80000.0 0.2 64000.0
子查询
# 这里u_name 一定要是别名,如果e.name 就报错,不知道为什么,hive 1.2.1 版本
hive (default)> SELECT e.u_name, e.salary_minux_fed_taxes
> FROM
> (
> SELECT UPPER(name) as u_name,salary,deductions["Federal Taxes"] as fed_taxes,
> ROUND(salary * (1 - deductions["Federal Taxes"])) as salary_minux_fed_taxes
> FROM employees
> )e
> WHERE e.salary_minux_fed_taxes > 7000;
e.u_name e.salary_minux_fed_taxes
JOHN DOE 80000.0
MARY SMITH 64000.0
TODD JONES 59500.0
BILL KING 51000.0
BOSS MAN 140000.0
FRED FINANCE 105000.0
STACY ACCOUNTANT 51000.0
# 下面的代码是等价的
hive (default)> FROM (
> SELECT UPPER(name) as u_name,salary,deductions["Federal Taxes"] as fed_taxes,
> ROUND(salary * (1 - deductions["Federal Taxes"])) as salary_minux_fed_taxes
> FROM employees
> )e
> SELECT e.u_name, e.salary_minux_fed_taxes
> WHERE e.salary_minux_fed_taxes > 7000;
# 可以总结下 hive 子查询的语法
如果内层查询中列名使用了内嵌函数,就需要给定别名,外层查询中如果要查这个字段,需要使用表别名.列别名
FROM (
SELECT col1,UPPER(col2) AS u_col2,...
FROM T1
WHERE ....
) T2
SELECT T2.col1,T2.u_col2
WHERE T2.col1 > ....;
CASE ..WHEN..THEN 语句和if条件语句类似,用于处理单个列的查询结果
hive> SELECT name,salary,
> CASE
> WHEN salary < 50000.0 THEN 'low'
> WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle'
> WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high'
> ELSE 'very high'
> END AS bracket FROM employees;
John Doe 100000.0 very high
Mary Smith 80000.0 high
Todd Jones 70000.0 high
Bill King 60000.0 middle
Boss Man 200000.0 very high
Fred Finance 150000.0 very high
Stacy Accountant 60000.0 middle
什么情况下hive 可以避免进行MapReduce
1、SELECT * FROM employees;
2、对于WHERE 语句中过滤条件只是分区字段这种情况(无论是否使用LIMIT语句限制输出记录条数),也是无需MapReduce过程的
SELECT * FROM employees WHERE country=’US’ AND state=’CA’ LIMIT 100;
country 和 state 两个字段是分区字段
3、set hive.exec.mode.local.auto=true。
hive会尝试使用本地模式执行其他的操作,即避免MapReduce过程
使用本地模式的3个条件
1).job的输入数据大小必须小于参数:hive.exec.mode.local.auto.inputbytes.max(默认128MB)
2).job的map数必须小于参数:hive 0.7 hive.exec.mode.local.auto.tasks.max(默认4)
hive 0.9 以后此参数改名了 set hive.exec.mode.local.auto.input.files.max(默认值是4)
3).job的reduce数必须为0或者1
where 语句中不能使用列别名
> ;
hive> SELECT name,salary,deductions["Federal Taxes"],
> salary * (1 - deductions["Federal Taxes"])
> FROM employees
> WHERE round(salary * (1 - deductions["Federal Taxes"])) > 70000;
John Doe 100000.0 0.2 80000.0
Boss Man 200000.0 0.3 140000.0
Fred Finance 150000.0 0.3 105000.0
# SELECT 语句 和WHERE 语句,计算表达式相同,能不能使用别名替换?答案是遗憾的,不能够使用别名,可以使用嵌套查询
hive> SELECT name,salary,deductions["Federal Taxes"] AS salary_minux_fed_taxes
> FROM employees
> WHERE salary_minux_fed_taxes > 70000;
FAILED: SemanticException [Error 10004]: Line 3:6 Invalid table alias or column reference 'salary_minux_fed_taxes': (possible column names are: name, salary, subordinates, deductions, address)
# 第3种嵌套查询的方法
> (SELECT name,salary,deductions["Federal Taxes"] as ded,salary * (1 - deductions["Federal Taxes"]) as salary_minux_fed_taxes
> FROM employees) e
> WHERE round(e.salary_minux_fed_taxes) > 70000;
John Doe 100000.0 0.2 80000.0
Boss Man 200000.0 0.3 140000.0
Fred Finance 150000.0 0.3 105000.0
谓词操作一般用于JOIN…ON 和 HAVING 和 WHERE 语句中
# 在低版本的hive 中浮点数的比较是有bug的,deductions["Federal Taxes"] = 0.2 的记录也被输出。这里实验hive的版本是2.3.2 最新版不存在这个问题
hive> SELECT name,salary,deductions["Federal Taxes"]
> FROM employees
> WHERE deductions["Federal Taxes"] > 0.2;
Boss Man 200000.0 0.3
Fred Finance 150000.0 0.3
hive> SELECT name,salary,deductions["Federal Taxes"]
> FROM employees;
John Doe 100000.0 0.2
Mary Smith 80000.0 0.2
Todd Jones 70000.0 0.15
Bill King 60000.0 0.15
Boss Man 200000.0 0.3
Fred Finance 150000.0 0.3
Stacy Accountant 60000.0 0.15
#低版本的hive中可以使用下面的代码实现大于0.2
hive> SELECT name,salary,deductions["Federal Taxes"]
> FROM employees
> WHERE deductions["Federal Taxes"] > CAST(0.2 AS FLOAT);
Boss Man 200000.0 0.3
Fred Finance 150000.0 0.3
#出现这样的问题的根本原因是浮点数进度的问题,默认情况下数字0.2 是DOUBLE 类型,是个近似值,0.2000000000001 而deductions["Federal Taxes"] 是float类型,转换成DOUBLE 其产生的值是:0.200000100000,这个时候问题就明朗了。
deductions["Federal Taxes"] DOUBLE 0.200000100000
0.2 DOUBLE 0.2000000000001
满足WHERE 条件,这样就会输出,所以表现出是>=。
# 经过测试在hive 1.2.2 bug依旧存在
hive (default)> SELECT name,salary,deductions["Federal Taxes"]
> FROM employees
> WHERE deductions["Federal Taxes"] > 0.2;
John Doe 100000.0 0.2
Mary Smith 80000.0 0.2
Boss Man 200000.0 0.3
Fred Finance 150000.0 0.3
hive (default)> SELECT name,salary,deductions["Federal Taxes"]
> FROM employees;
John Doe 100000.0 0.2
Mary Smith 80000.0 0.2
Todd Jones 70000.0 0.15
Bill King 60000.0 0.15
Boss Man 200000.0 0.3
Fred Finance 150000.0 0.3
Stacy Accountant 60000.0 0.15
对于浮点数的一些忠告:需要避免任何从窄类型隐式转换到更广泛类型的操作,和钱相关的都避免使用浮点数,可以使用DOUBLE
LIKE 和 RLIKE 都可以使用正则表达式,匹配列字段,只是RLIKE 使用java的正则表达式,功能更强大
hive> SELECT name,address.street FROM employees WHERE address.street LIKE '%Ave.';
John Doe 1 Michigan Ave.
Todd Jones 200 Chicago Ave.
hive> SELECT name,address.city FROM employees WHERE address.city LIKE 'O%';
Todd Jones Oak Park
Bill King Obscuria
hive> SELECT name,address.street FROM employees WHERE address.street LIKE '%Chi%';
Todd Jones 200 Chicago Ave.
# RLIKE 使用java 的正则表达式,可以实现更强大的正则表达式功能
hive> SELECT name,address.street
> FROM employees WHERE address.street RLIKE '.*(Chicago|Ontario).*';
Mary Smith 100 Ontario St.
Todd Jones 200 Chicago Ave.
hive> SELECT name,address FROM employees
> WHERE address.street LIKE '%Chicago%' OR address.street LIKE '%Ontario%';
Mary Smith {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601}
Todd Jones {"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700}
GROUP BY 语句通常和聚合函数一起使用,按照一个或者多个列对结果进行分组
hive (default)> SELECT year(symbol),avg(price_close) FROM stocks
> WHERE exchanges = 'NASDAQ' AND ymd = 'AAPL'
> GROUP BY year(symbol);
1984 25.578625440597534
1985 20.193676221040867
1986 32.46102808021274
1987 53.88968399108163
1988 41.540079275138766
1989 41.65976212516664
1990 37.56268799823263
1991 52.49553383386182
1992 54.80338610251119
1993 41.02671956450572
1994 34.0813495847914
1995 40.542103593311616
1996 24.917559398440865
1997 17.965850004565574
1998 30.565119240019058
1999 57.77071460844979
2000 71.74892876261757
2001 20.219112992286682
2002 19.139444423100304
2003 18.54476193019322
2004 35.52694458431668
2005 52.401745992993554
2006 70.81063753105255
2007 128.27390423049016
2008 141.9790115054888
2009 146.81412711976066
2010 204.72159912109376
# 聚合函数没有使用GROUP BY 语句会报语法错误
hive (default)> SELECT year(symbol),avg(price_close) FROM stocks
> WHERE exchanges = 'NASDAQ' AND ymd = 'AAPL';
FAILED: SemanticException Line 0:-1 Expression not in GROUP BY key 'symbol'
HAVING 语句 允许用户通过一个简单的语法完成原本需要通过子查询才能对GROUP BY 语句产生的分组进行条件过滤的任务。
hive (default)> SELECT year(symbol),avg(price_close) FROM stocks
> WHERE exchanges = 'NASDAQ' AND ymd = 'AAPL'
> GROUP BY year(symbol)
> HAVING avg(price_close) > 55.0;
1999 57.77071460844979
2000 71.74892876261757
2006 70.81063753105255
2007 128.27390423049016
2008 141.9790115054888
2009 146.81412711976066
2010 204.72159912109376
# 下面使用子查询实现
hive (default)> FROM (
> SELECT year(symbol)AS symbol_yyyy_MM_dd,avg(price_close) as price_close_avg FROM stocks
> WHERE exchanges = 'NASDAQ' AND ymd = 'AAPL'
> GROUP BY year(symbol)
> ) s
> SELECT symbol_yyyy_MM_dd , s.price_close_avg
> WHERE s.price_close_avg > 55.0;
1999 57.77071460844979
2000 71.74892876261757
2006 70.81063753105255
2007 128.27390423049016
2008 141.9790115054888
2009 146.81412711976066
2010 204.72159912109376
JOIN 语句
Hive 支持通常SQL JOIN语句,但是只支持等值连接
INNER JOIN 内连接,只有进行连接的两个表中都存在于连接表中想匹配的数据才会被保留。
hive (default)> SELECT a.symbol,a.price_close,b.price_close
> FROM stocks a JOIN stocks b ON a.symbol = b.symbol
> WHERE a.ymd = 'AAPL' AND b.ymd = 'IBM';
# 输出:
2010-02-08 194.12 121.88
2010-02-05 195.46 123.52
2010-02-04 192.05 123.0
2010-02-03 199.23 125.66
2010-02-02 195.86 125.53
2010-02-01 194.73 124.67
2010-01-29 192.06 122.39
2010-01-28 199.29 123.75
2010-01-27 207.88 126.33
2010-01-26 205.94 125.75
2010-01-25 203.07 126.12
......