目录

  • ​​hive outline​​
  • ​​hive parse_url​​
  • ​​hive parse_url_tuple​​

hive outline

​​链接​​

对以下数据建表:tb_url

1 http://facebook.com/path/p1.php?query=1
2 http://tongji.baidu.com/news/index.jsp?uuid=frank
3 http://www.jdwz.com/index?source=baidu
4 http://www.itcast.cn/index?source=alibaba

hive parse_url

该函数属于UDF类型

语法:

parse_url(url, partToExtract[, key]) - extracts a part from a URL
Parts: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO key

需求:实现对URL进行分析,从URL中获取每个ID对应HOST、PATH以及QUERY

id

url

1

​http://facebook.com/path/p1.php?query=1​

2

​http://tongji.baidu.com/news/index.jsp?uuid=frank​

3

​http://www.jdwz.com/index?source=baidu​

4

​http://www.itcast.cn/index?source=alibaba​

select id,
parse_url(url, "HOST") as host,
parse_url(url, "PATH") as path,
parse_url(url, "QUERY") as query
from tb_url;

id

host

path

query

1

facebook.com

/path/p1.php

query=1

2

tongji.baidu.com

/news/index.jsp

uuid=frank

3

www.jdwz.com

/index

source=baidu

4

www.itcast.cn

/index

source=alibaba

​注意:​​使用parse_url函数每次只能解析一个参数,导致构建多列需要调用多次该函数,我们希望能实现调用一次函数,就可以将多个参数进行解析,得到多列结果。parse_url_tuple可以实现该效果

hive parse_url_tuple

该函数属于UDTF类型(常结合lateral view侧视图)

语法:

parse_url_tuple(url, partname1, partname2, ..., partnameN) - extracts N (N>=1) parts from a URL.
It takes a URL and one or multiple partnames, and returns a tuple. All the input parameters and output column types are string.
Partname: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:<KEY_NAME>
parse_url在使用时可以指定多个参数
第一个参数:url:指定要解析的URL
第二个参数:key1:指定要解析的内容1
……
第N个参数:keyN:指定要解析的内容N
select a.id,
b.host,
b.path,
b.query
from tb_url a lateral view parse_url_tuple(url, "HOST", "PATH", "QUERY") b as host, path, query;

id

host

path

query

1

facebook.com

/path/p1.php

query=1

2

tongji.baidu.com

/news/index.jsp

uuid=frank

3

www.jdwz.com

/index

source=baidu

4

www.itcast.cn

/index

source=alibaba