greenplumhadoop集成 greenplum partition

转载

技术极客之光 2023-07-13 10:45:39

文章标签 greenplumhadoop集成数据 ci Distributed 文章分类 Hadoop 大数据

一、环境准备

GP集群环境情况如下图所示，master实现容错，配置standby master。两个节点为segment节点，其中每个几点配置两个segment，未配置mirror segment。

greenplumhadoop集成 greenplum partition_ci

二、使用案例

2.1登录数据库

登录Greenplum数据库，默认的数据库为postgres

[gpadmin@sparkproject1 ~]$ psql -d testDB psql (8.2.15) Type "help" for help. testDB=#

2.2创建数据库

testDB=# create database mydb; CREATE DATABASE

2.3创建表

Greenplum中创建表语句与普通的数据库的建表语句区别不大，不同之处主要包括以下几点：

在greenplum中建表时需要指定表的分布键（DISTRIBUTED BY）。
如果表需要用某个字段分区，可以使用PARTITION BY将表建成分区表。
使用like操作创建一样表结构的表。

Greenplum有两种数据分布策略：

Hash分布。通过hash值路由到特定的Segment，语法为DISTRIBUTED BY（..），如果不指定分布键，则默认第一个字段为分布键。
随机分布。也叫平均分布，在执行关联等操作时性能较差，语法为Distributed randomly

以下两种方式结果一样，都是以id作为分布键。

testDB=# create table test1(id int,name varchar(128)); NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. CREATE TABLE

testDB=# create table test2(id int,name varchar(128)) distributed by (id); CREATE TABLE

描述表信息

testDB=# \d test1 Table "public.test1" Column \| Type \| Modifiers --------+------------------------+----------- id \| integer \| name \| character varying(128) \| Distributed by: (id)

下面的创建表语句采用了随机分布的方式

testDB=# create table test3(id int,name varchar(128)) distributed randomly; CREATE TABLE

使用like创建表

testDB=# create table test3_like(like test3); NOTICE: Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table CREATE TABLE

2.4显示数据库

testDB=# \l List of databases Name \| Owner \| Encoding \| Access privileges -----------+---------+----------+--------------------- mydb \| gpadmin \| UTF8 \| postgres \| gpadmin \| UTF8 \| template0 \| gpadmin \| UTF8 \| =c/gpadmin : gpadmin=CTc/gpadmin template1 \| gpadmin \| UTF8 \| =c/gpadmin : gpadmin=CTc/gpadmin testDB \| gpadmin \| UTF8 \| (5 rows)

2.5显示表

testDB=# \dt List of relations Schema \| Name \| Type \| Owner \| Storage --------+------------+-------+---------+--------- public \| test01 \| table \| gpadmin \| heap public \| test1 \| table \| gpadmin \| heap public \| test2 \| table \| gpadmin \| heap public \| test3 \| table \| gpadmin \| heap public \| test3_like \| table \| gpadmin \| heap (5 rows)

2.6插入数据

testDB=# insert into test1 values(1,'tom'),(2,'jack'),(3,'Bob'); INSERT 0 3

2.7查询数据

Greenplum的数据分布在所有的Segment上，当从一个表查询数据的时候，Master的数据展现顺序是以Master先接收到的数据的顺序，每个segment的数据到达Master的顺序是随机的，所以Select的结果的顺序是随机的。

testDB=# select * from test1; id \| name ----+------ 3 \| Bob 1 \| tom 2 \| jack (3 rows)

2.8create table as与select into

create table as与select into功能一样，是根据select的结果创建一个新表，在临时分析数据的时候十分方便。在创建表时如果默认不指定分布键，那么Greenplum根据执行的select得到的结果集来选择，不用再次重分布数据的字段作为表的分布键

testDB=# create table test4 as select * from test1; NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'id' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. SELECT 3

可以手动指定分布键

testDB=# create table test5 as select * from test1 distributed by(name); SELECT 3

查看表结构信息

testDB=# \d test5 Table "public.test5" Column \| Type \| Modifiers --------+------------------------+----------- id \| integer \| name \| character varying(128) \| Distributed by: (name)

Select into比create table as简单，但是select into不能指定分布键，只能使用默认的分布键。

testDB=# select * into test6 from test1; NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'id' as the Greenplum Database data distribution key for this table. HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. SELECT 3

2.9 Explain

Explain用于查询一个表的执行计划，它在SQL优化的时候经常要用到。

testDB=# explain select * from orders join customer on orders.cid=customer.id; QUERY PLAN ----------------------------------------------------------------------------------------------- Gather Motion 4:1 (slice2; segments: 4) (cost=3.07..6.21 rows=1 width=21) -> Hash Join (cost=3.07..6.21 rows=1 width=21) Hash Cond: orders.cid = customer.id -> Redistribute Motion 4:4 (slice1; segments: 4) (cost=0.00..3.09 rows=1 width=13) Hash Key: orders.cid -> Seq Scan on orders (cost=0.00..3.03 rows=1 width=13) -> Hash (cost=3.03..3.03 rows=1 width=8) -> Seq Scan on customer (cost=0.00..3.03 rows=1 width=8) (8 rows)

2.10分析函数之开窗函数

testDB=# select depname,empno,salary,rank() over (partition by depname order by salary desc),row_number() over (partition by depname order by salary desc) from empsalary; depname \| empno \| salary \| rank \| row_number -----------+-------+--------+------+------------ develop \| 9 \| 10000 \| 1 \| 1 develop \| 7 \| 9000 \| 2 \| 2 develop \| 10 \| 9000 \| 2 \| 3 develop \| 5 \| 7000 \| 4 \| 4 personnel \| 6 \| 5600 \| 1 \| 1 personnel \| 3 \| 3900 \| 2 \| 2 sales \| 8 \| 8000 \| 1 \| 1 sales \| 4 \| 5500 \| 2 \| 2 sales \| 1 \| 5000 \| 3 \| 3 sales \| 2 \| 4500 \| 4 \| 4 (10 rows)

2.11分区表

按照时间分区，创建2018725~2018730

CREATE TABLE test_partition_range ( id NUMERIC, name CHARACTER VARYING(32), dw_end_date DATE ) DISTRIBUTED BY (id) PARTITION BY range (dw_end_date) ( PARTITION p2018725 start('2018-7-25'::DATE) END ('2018-7-26'::DATE), PARTITION p2018726 start('2018-7-26'::DATE) END ('2018-7-27'::DATE), PARTITION p2018727 start('2018-7-27'::DATE) END ('2018-7-28'::DATE), PARTITION p2018728 start('2018-7-28'::DATE) END ('2018-7-29'::DATE), PARTITION p2018729 start('2018-7-29'::DATE) END ('2018-7-30'::DATE), PARTITION p2018730 start('2018-7-30'::DATE) END ('2018-7-31'::DATE) );

查看表信息

testDB=# \d+ test_partition_range; Table "public.test_partition_range" Column \| Type \| Modifiers \| Storage \| Description -------------+-----------------------+-----------+----------+------------- id \| numeric \| \| main \| name \| character varying(32) \| \| extended \| dw_end_date \| date \| \| plain \| Child tables: test_partition_range_1_prt_p2018725, test_partition_range_1_prt_p2018726, test_partition_range_1_prt_p2018727, test_partition_range_1_prt_p2018728, test_partition_range_1_prt_p2018729, test_partition_range_1_prt_p2018730 Has OIDs: no Distributed by: (id)

使用every创建2018725~2018730分区表

CREATE TABLE test_partition_every ( id NUMERIC, name CHARACTER VARYING(32), dw_end_date DATE ) DISTRIBUTED BY (id) PARTITION BY range (dw_end_date) (PARTITION p201807 start('2018-7-25'::DATE) END ('2018-7-30'::DATE) every('1 days'::interval));

查看表信息

testDB=# \d+ test_partition_every; Table "public.test_partition_every" Column \| Type \| Modifiers \| Storage \| Description -------------+-----------------------+-----------+----------+------------- id \| numeric \| \| main \| name \| character varying(32) \| \| extended \| dw_end_date \| date \| \| plain \| Child tables: test_partition_every_1_prt_p201807_1, test_partition_every_1_prt_p201807_2, test_partition_every_1_prt_p201807_3, test_partition_every_1_prt_p201807_4, test_partition_every_1_prt_p201807_5 Has OIDs: no Distributed by: (id)

创建list分区

CREATE TABLE test_partition_list ( id NUMERIC, city CHARACTER VARYING(32) ) DISTRIBUTED BY (id) PARTITION BY list (city) ( PARTITION shanghai VALUES ('shanghai'), PARTITION beijing VALUES ('beijing'), PARTITION guangzhou VALUES ('guangzhou'), DEFAULT PARTITION other_city ) ;

查看表信息

testDB=# \d+ test_partition_list; Table "public.test_partition_list" Column \| Type \| Modifiers \| Storage \| Description --------+-----------------------+-----------+----------+------------- id \| numeric \| \| main \| city \| character varying(32) \| \| extended \| Child tables: test_partition_list_1_prt_beijing, test_partition_list_1_prt_guangzhou, test_partition_list_1_prt_other_city, test_partition_list_1_prt_shanghai Has OIDs: no Distributed by: (id)

2.12外部表

Greenplum在数据加载上有一个明显的优势，就是支持数据并发加载，gpfdist就是并发加载的工具，在数据库中对应的就是外部表。
gpfdist的实现架构图如下所示

greenplumhadoop集成 greenplum partition_数据_02

启动gpfdist及创建外部表的步骤如下：

首先在文件服务器上启动gpfdist的服务，指定文件目录及端口

$ gpfdist -d /home/gpadmin -p 8081 -l /home/gpadmin/log &

查看是否成功

[gpadmin@sparkproject1 tmp]$ cat gpfdist.log nohup: ignoring input Serving HTTP on port 8888, directory /home/gpadmin

查看端口号

[gpadmin@sparkproject1 greenplum-db]$ lsof -i:8888 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME gpfdist 51700 gpadmin 6u IPv6 124806 0t0 TCP *:ddi-tcp-1 (LISTEN)

准备需要加载的数据放在/home/gpadmin目录下，或者其子目录下，并创建外部表。

CREATE EXTERNAL TABLE test_ext ( id INTEGER, name VARCHAR(128) ) location ('gpfdist://192.168.10.136:8888/test_ext.txt') format 'text' (delimiter AS '\|' NULL AS '' ESCAPE 'off') encoding 'GB18030' Log errors INTO test_ext_err segment reject limit 10 rows;

查看表信息

testDB=# \d test_ext External table "public.test_ext" Column \| Type \| Modifiers --------+------------------------+----------- id \| integer \| name \| character varying(128) \| Type: readable Encoding: GB18030 Format type: text Format options: delimiter '\|' null '' escape 'off' External location: gpfdist://192.168.10.136:8888/test_ext.txt Segment reject limit: 10 rows Error table: test_ext_err

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：gpio python 状态 gpio inittypedef

下一篇：ios 移动端样式 css移动端布局

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯