一、环境准备

GP集群环境情况如下图所示,master实现容错,配置standby master。两个节点为segment节点,其中每个几点配置两个segment,未配置mirror segment。

greenplumhadoop集成 greenplum partition_ci

 

二、使用案例

2.1登录数据库

登录Greenplum数据库,默认的数据库为postgres

[gpadmin@sparkproject1 ~]$ psql -d testDB

psql (8.2.15)

Type "help" for help.

 

testDB=#

2.2创建数据库

testDB=# create database mydb;

CREATE DATABASE

2.3创建表

Greenplum中创建表语句与普通的数据库的建表语句区别不大,不同之处主要包括以下几点:

  1. 在greenplum中建表时需要指定表的分布键(DISTRIBUTED BY)。
  2. 如果表需要用某个字段分区,可以使用PARTITION BY将表建成分区表。
  3. 使用like操作创建一样表结构的表。

Greenplum有两种数据分布策略:

  1. Hash分布。通过hash值路由到特定的Segment,语法为DISTRIBUTED BY(..),如果不指定分布键,则默认第一个字段为分布键。
  2. 随机分布。也叫平均分布,在执行关联等操作时性能较差,语法为Distributed randomly

以下两种方式结果一样,都是以id作为分布键。

testDB=# create table test1(id int,name varchar(128));

NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.

HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.

CREATE TABLE

testDB=# create table test2(id int,name varchar(128)) distributed by (id);

CREATE TABLE

描述表信息

testDB=# \d test1

            Table "public.test1"

 Column |          Type          | Modifiers

--------+------------------------+-----------

 id     | integer                |

 name   | character varying(128) |

Distributed by: (id)

下面的创建表语句采用了随机分布的方式

testDB=# create table test3(id int,name varchar(128)) distributed randomly;

CREATE TABLE

使用like创建表

testDB=# create table test3_like(like test3);

NOTICE:  Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table

CREATE TABLE

2.4显示数据库

 

testDB=# \l

                  List of databases

   Name    |  Owner  | Encoding |  Access privileges 

-----------+---------+----------+---------------------

 mydb      | gpadmin | UTF8     |

 postgres  | gpadmin | UTF8     |

 template0 | gpadmin | UTF8     | =c/gpadmin         

                                : gpadmin=CTc/gpadmin

 template1 | gpadmin | UTF8     | =c/gpadmin         

                                : gpadmin=CTc/gpadmin

 testDB    | gpadmin | UTF8     |

(5 rows)

2.5显示表

testDB=# \dt

                List of relations

 Schema |    Name    | Type  |  Owner  | Storage

--------+------------+-------+---------+---------

 public | test01     | table | gpadmin | heap

 public | test1      | table | gpadmin | heap

 public | test2      | table | gpadmin | heap

 public | test3      | table | gpadmin | heap

 public | test3_like | table | gpadmin | heap

(5 rows)

2.6插入数据

testDB=# insert into test1 values(1,'tom'),(2,'jack'),(3,'Bob');

INSERT 0 3

2.7查询数据

Greenplum的数据分布在所有的Segment上,当从一个表查询数据的时候,Master的数据展现顺序是以Master先接收到的数据的顺序,每个segment的数据到达Master的顺序是随机的,所以Select的结果的顺序是随机的。

testDB=# select * from test1;

 id | name

----+------

  3 | Bob

  1 | tom

  2 | jack

(3 rows)

2.8create table as与select into

create table as与select into功能一样,是根据select的结果创建一个新表,在临时分析数据的时候十分方便。在创建表时如果默认不指定分布键,那么Greenplum根据执行的select得到的结果集来选择,不用再次重分布数据的字段作为表的分布键

testDB=# create table test4 as select * from test1;

NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'id' as the Greenplum Database data distribution key for this table.

HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.

SELECT 3

可以手动指定分布键

testDB=# create table test5 as select * from test1 distributed by(name);

SELECT 3

查看表结构信息

testDB=# \d test5

            Table "public.test5"

 Column |          Type          | Modifiers

--------+------------------------+-----------

 id     | integer                |

 name   | character varying(128) |

Distributed by: (name)

Select into比create table as简单,但是select into不能指定分布键,只能使用默认的分布键。

testDB=# select * into test6 from test1;

NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'id' as the Greenplum Database data distribution key for this table.

HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.

SELECT 3

2.9 Explain

Explain用于查询一个表的执行计划,它在SQL优化的时候经常要用到。

testDB=# explain select * from orders join customer on orders.cid=customer.id;

                                          QUERY PLAN                                          

-----------------------------------------------------------------------------------------------

 Gather Motion 4:1  (slice2; segments: 4)  (cost=3.07..6.21 rows=1 width=21)

   ->  Hash Join  (cost=3.07..6.21 rows=1 width=21)

         Hash Cond: orders.cid = customer.id

         ->  Redistribute Motion 4:4  (slice1; segments: 4)  (cost=0.00..3.09 rows=1 width=13)

               Hash Key: orders.cid

               ->  Seq Scan on orders  (cost=0.00..3.03 rows=1 width=13)

         ->  Hash  (cost=3.03..3.03 rows=1 width=8)

               ->  Seq Scan on customer  (cost=0.00..3.03 rows=1 width=8)

(8 rows)

2.10分析函数之开窗函数

testDB=# select depname,empno,salary,rank() over (partition by depname order by salary desc),row_number() over (partition by depname order by salary desc) from empsalary;

  depname  | empno | salary | rank | row_number

-----------+-------+--------+------+------------

 develop   |     9 |  10000 |    1 |          1

 develop   |     7 |   9000 |    2 |          2

 develop   |    10 |   9000 |    2 |          3

 develop   |     5 |   7000 |    4 |          4

 personnel |     6 |   5600 |    1 |          1

 personnel |     3 |   3900 |    2 |          2

 sales     |     8 |   8000 |    1 |          1

 sales     |     4 |   5500 |    2 |          2

 sales     |     1 |   5000 |    3 |          3

 sales     |     2 |   4500 |    4 |          4

(10 rows)

2.11分区表

按照时间分区,创建2018725~2018730

CREATE TABLE test_partition_range (

        id NUMERIC,

        name CHARACTER VARYING(32),

        dw_end_date DATE

        ) DISTRIBUTED BY (id) PARTITION BY range (dw_end_date) (

        PARTITION p2018725 start('2018-7-25'::DATE) END ('2018-7-26'::DATE),

        PARTITION p2018726 start('2018-7-26'::DATE) END ('2018-7-27'::DATE),

        PARTITION p2018727 start('2018-7-27'::DATE) END ('2018-7-28'::DATE),

        PARTITION p2018728 start('2018-7-28'::DATE) END ('2018-7-29'::DATE),

        PARTITION p2018729 start('2018-7-29'::DATE) END ('2018-7-30'::DATE),

        PARTITION p2018730 start('2018-7-30'::DATE) END ('2018-7-31'::DATE)

        );

查看表信息

testDB=# \d+ test_partition_range;

                   Table "public.test_partition_range"

   Column    |         Type          | Modifiers | Storage  | Description

-------------+-----------------------+-----------+----------+-------------

 id          | numeric               |           | main     |

 name        | character varying(32) |           | extended |

 dw_end_date | date                  |           | plain    |

Child tables: test_partition_range_1_prt_p2018725,

              test_partition_range_1_prt_p2018726,

              test_partition_range_1_prt_p2018727,

              test_partition_range_1_prt_p2018728,

              test_partition_range_1_prt_p2018729,

              test_partition_range_1_prt_p2018730

Has OIDs: no

Distributed by: (id)

使用every创建2018725~2018730分区表

CREATE TABLE test_partition_every (

        id NUMERIC,

        name CHARACTER VARYING(32),

        dw_end_date DATE

        ) DISTRIBUTED BY (id) PARTITION BY range (dw_end_date) (PARTITION p201807 start('2018-7-25'::DATE) END ('2018-7-30'::DATE) every('1 days'::interval));

查看表信息

testDB=# \d+ test_partition_every;

                   Table "public.test_partition_every"

   Column    |         Type          | Modifiers | Storage  | Description

-------------+-----------------------+-----------+----------+-------------

 id          | numeric               |           | main     |

 name        | character varying(32) |           | extended |

 dw_end_date | date                  |           | plain    |

Child tables: test_partition_every_1_prt_p201807_1,

              test_partition_every_1_prt_p201807_2,

              test_partition_every_1_prt_p201807_3,

              test_partition_every_1_prt_p201807_4,

              test_partition_every_1_prt_p201807_5

Has OIDs: no

Distributed by: (id)

创建list分区

CREATE TABLE test_partition_list (

        id NUMERIC,

        city CHARACTER VARYING(32)

        ) DISTRIBUTED BY (id) PARTITION BY list (city) (

        PARTITION shanghai VALUES ('shanghai'),

        PARTITION beijing VALUES ('beijing'),

        PARTITION guangzhou VALUES ('guangzhou'),

        DEFAULT PARTITION other_city

        ) ;

查看表信息

testDB=# \d+ test_partition_list;

                 Table "public.test_partition_list"

 Column |         Type          | Modifiers | Storage  | Description

--------+-----------------------+-----------+----------+-------------

 id     | numeric               |           | main     |

 city   | character varying(32) |           | extended |

Child tables: test_partition_list_1_prt_beijing,

              test_partition_list_1_prt_guangzhou,

              test_partition_list_1_prt_other_city,

              test_partition_list_1_prt_shanghai

Has OIDs: no

Distributed by: (id)

2.12外部表

Greenplum在数据加载上有一个明显的优势,就是支持数据并发加载,gpfdist就是并发加载的工具,在数据库中对应的就是外部表。
gpfdist的实现架构图如下所示

greenplumhadoop集成 greenplum partition_数据_02

启动gpfdist及创建外部表的步骤如下:

  1. 首先在文件服务器上启动gpfdist的服务,指定文件目录及端口

$ gpfdist -d /home/gpadmin -p 8081 -l /home/gpadmin/log &

       查看是否成功

[gpadmin@sparkproject1 tmp]$ cat gpfdist.log

nohup: ignoring input

Serving HTTP on port 8888, directory /home/gpadmin

       查看端口号

[gpadmin@sparkproject1 greenplum-db]$ lsof -i:8888

COMMAND   PID    USER   FD   TYPE DEVICE SIZE/OFF NODE NAME

gpfdist 51700 gpadmin    6u  IPv6 124806      0t0  TCP *:ddi-tcp-1 (LISTEN)

  1. 准备需要加载的数据放在/home/gpadmin目录下,或者其子目录下,并创建外部表。

CREATE EXTERNAL TABLE test_ext (

id INTEGER,

name VARCHAR(128)

) location ('gpfdist://192.168.10.136:8888/test_ext.txt') format 'text' (delimiter AS  '|' NULL AS '' ESCAPE 'off') encoding 'GB18030' Log errors

INTO test_ext_err segment reject limit 10 rows;

查看表信息

testDB=# \d test_ext

      External table "public.test_ext"

 Column |          Type          | Modifiers

--------+------------------------+-----------

 id     | integer                |

 name   | character varying(128) |

Type: readable

Encoding: GB18030

Format type: text

Format options: delimiter '|' null '' escape 'off'

External location: gpfdist://192.168.10.136:8888/test_ext.txt

Segment reject limit: 10 rows

Error table: test_ext_err