在Hive中使用Avro

转载

xd502djj 2021-08-05 15:10:16

文章标签 hive hadoop 数据 apache jar 文章分类 代码人生

作者:过往记忆
可以转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明

　　Avro（读音类似于[ævrə]）是Hadoop的一个子项目，由Hadoop的创始人Doug Cutting牵头开发。Avro是一个数据序列化系统，设计用于支持大批量数据交换的应用。它的主要特点有：支持二进制序列化方式，可以便捷，快速地处理大量数据；动态语言友好，Avro提供的机制使动态语言可以方便地处理Avro数据。
　　在Hive中，我们可以将数据使用Avro格式存储，本文以avro-1.7.1.jar为例，进行说明。

　　如果需要在Hive中使用Avro，需要在$HIVE_HOME/lib目录下放入以下四个工具包：avro-1.7.1.jar、avro-tools-1.7.4.jar、 jackson-core-asl-1.8.8.jar、jackson-mapper-asl-1.8.8.jar。当然，你也可以把这几个包存在别的路径下面，但是你需要把这四个包放在CLASSPATH中。

　　为了解析Avro格式的数据，我们可以在Hive建表的时候用下面语句：

01	hive> CREATE EXTERNAL TABLE tweets

02	> COMMENT "A table backed by Avro data with the

03	> Avro schema embedded in the CREATE TABLE statement"

04	> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

05	> STORED AS

06	> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

07	> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

08	> LOCATION '/user/wyp/examples/input/'

09	> TBLPROPERTIES (

10	> 'avro.schema.literal'='{

11	> "type": "record",

12	> "name": "Tweet",

13	> "namespace": "com.miguno.avro",

14	> "fields": [

15	> { "name":"username", "type":"string"},

16	> { "name":"tweet", "type":"string"},

17	> { "name":"timestamp", "type":"long"}

> ]

> }'

> );

22	Time taken: 0.076 seconds

24	hive> describe tweets;

26	username string from deserializer

27	tweet string from deserializer

28	timestamp bigint from deserializer

然后用Snappy压缩我们需要的数据，下面是压缩前我们的数据：

{

02	"username": "miguno",

03	"tweet": "Rock: Nerf paper, scissors is fine.",

04	"timestamp": 1366150681

{

07	"username": "BlizzardCS",

08	"tweet": "Works as intended. Terran is IMBA.",

09	"timestamp": 1366154481

{

12	"username": "DarkTemplar",

13	"tweet": "From the shadows I come!",

14	"timestamp": 1366154681

{

17	"username": "VoidRay",

18	"tweet": "Prismatic core online!",

19	"timestamp": 1366160000

}

压缩完的数据假如存放在/home/wyp/twitter.avsc文件中，我们将这个数据复制到HDFS中的/user/wyp/examples/input/目录下：

1	hadoop fs -put /home/wyp/twitter.avro /user/wyp/examples/input/

然后我们就可以在Hive中使用了：

1	hive> select * from tweets limit 5;;

3	miguno Rock: Nerf paper, scissors is fine. 1366150681

4	BlizzardCS Works as intended. Terran is IMBA. 1366154481

5	DarkTemplar From the shadows I come! 1366154681

6	VoidRay Prismatic core online! 1366160000

7	Time taken: 0.495 seconds, Fetched: 4 row(s)

当然，我们也可以将avro.schema.literal中的

{

02	"type": "record",

03	"name": "Tweet",

04	"namespace": "com.miguno.avro",

05	"fields": [

{

07	"name": "username",

08	"type": "string"

{

11	"name": "tweet",

12	"type": "string"

{

15	"name": "timestamp",

16	"type": "long"

}

]

}

存放在一个文件中，比如：twitter.avsc,然后上面的建表语句就可以修改为：

01	CREATE EXTERNAL TABLE tweets

02	COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"

03	ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED AS

05	INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

06	OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

07	LOCATION '/user/wyp/examples/input/'

08	TBLPROPERTIES (

09	'avro.schema.url'='hdfs:///user/wyp/examples/schema/twitter.avsc'

);

效果和上面的一样。本博客文章除特别声明，全部都是原创！

上一篇：把ThreadLocal原理及内存泄露的场景讲清楚了

下一篇：持续集成开发-CICD --Jekins集成Git gitlib nexs

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

在Hive中使用Avro

在Hive中使用Avro

51CTO博客