gvcf文件建立索引的begin和end值怎么确定 gp创建索引

转载

huatechinfo 2024-04-07 07:26:07

文章标签 模型创建索引 PostgreSQL ide postgresql 文章分类 数据仓库大数据

前言

gvcf文件建立索引的begin和end值怎么确定 gp创建索引_postgresql

在昨天创建 GIN索引时，我们使用了pg_trgm插件，pg_trgm主要提供的是基于三元模型(trigram)匹配的字母数字文本相似度的函数和操作符。以及支持快速搜索相似字符串的索引操作符类。简而言之，trigram是三个词的序列如"中国/人民/万岁"，"打开/你的/心扉"。这与自然语言模型相关。在没有使用插件直接创建 GIN索引的情况下，默认情况下会报错，后面的例子会介绍到。

ERROR: data type text has no default operator class for access method "gin" HINT: You must specify an operator class for the index or define a default operator class for the data typ

Full Text Search

在没有通过pg_trgm插件创建 GIN索引的情况下，需要使用 FTS，即全文搜索。PostgreSQL提供两种数据类型进行全文搜索，tsvector类型表示一个为文本搜索优化的形式下的文档，tsquery类型表示一个文本查询。

让我们来建张表和索引测试一下。

create table documents(doc text, doc_tsv tsvector);

insert into documents(doc) values
('Whats new in PostgreSQL 12'), 
('SQL and developer-related features'),
('All relevant features will be covered'),
('PostgreSQL 12 provides some new features that are especially important to developers'), 
('The PostgreSQL documentation has traditionally been very good'), 
('PostgreSQL 12 has more features that can simplify development'),
('PostgreSQL has a better solution to the problem'),
('Suppose we want to store data in kilometers and nautical miles'),
('For the sake of simplicity');

pgbench=# update documents set doc_tsv=to_tsvector(doc);
UPDATE 9

create table documents(doc text, doc_tsv tsvector);

insert into documents(doc) values
('Whats new in PostgreSQL 12'), 
('SQL and developer-related features'),
('All relevant features will be covered'),
('PostgreSQL 12 provides some new features that are especially important to developers'), 
('The PostgreSQL documentation has traditionally been very good'), 
('PostgreSQL 12 has more features that can simplify development'),
('PostgreSQL has a better solution to the problem'),
('Suppose we want to store data in kilometers and nautical miles'),
('For the sake of simplicity');

pgbench=# update documents set doc_tsv=to_tsvector(doc);
UPDATE 9

生成数据之后，我们可以在表中看到 tsvector类型的数据。

pgbench=# select doc_tsv from documents;
                                                                 doc_tsv                                                                 
-------------------------------------------------------------------------------------------------------------------
 '12':5 'in':3 'new':2 'postgresql':4 'whats':1
 'and':2 'developer':4 'developer-related':3 'features':6 'related':5 'sql':1
 'all':1 'be':5 'covered':6 'features':3 'relevant':2 'will':4
 '12':2 'are':8 'developers':12 'especially':9 'features':6 'important':10 'new':5 'postgresql':1 'provides':3 'some':4 'that':7 'to':11
 'been':6 'documentation':3 'good':8 'has':4 'postgresql':2 'the':1 'traditionally':5 'very':7
 '12':2 'can':7 'development':9 'features':5 'has':3 'more':4 'postgresql':1 'simplify':8 'that':6
 'a':3 'better':4 'has':2 'postgresql':1 'problem':8 'solution':5 'the':7 'to':6
 'and':9 'data':6 'in':7 'kilometers':8 'miles':11 'nautical':10 'store':5 'suppose':1 'to':4 'want':3 'we':2
 'for':1 'of':4 'sake':3 'simplicity':5 'the':2
(9 rows)

pgbench=# select doc_tsv from documents;
                                                                 doc_tsv                                                                 
-------------------------------------------------------------------------------------------------------------------
 '12':5 'in':3 'new':2 'postgresql':4 'whats':1
 'and':2 'developer':4 'developer-related':3 'features':6 'related':5 'sql':1
 'all':1 'be':5 'covered':6 'features':3 'relevant':2 'will':4
 '12':2 'are':8 'developers':12 'especially':9 'features':6 'important':10 'new':5 'postgresql':1 'provides':3 'some':4 'that':7 'to':11
 'been':6 'documentation':3 'good':8 'has':4 'postgresql':2 'the':1 'traditionally':5 'very':7
 '12':2 'can':7 'development':9 'features':5 'has':3 'more':4 'postgresql':1 'simplify':8 'that':6
 'a':3 'better':4 'has':2 'postgresql':1 'problem':8 'solution':5 'the':7 'to':6
 'and':9 'data':6 'in':7 'kilometers':8 'miles':11 'nautical':10 'store':5 'suppose':1 'to':4 'want':3 'we':2
 'for':1 'of':4 'sake':3 'simplicity':5 'the':2
(9 rows)

我们会发现它记录下每一个单词的位置。例如， Whats new in PostgreSQL 12。将单词 Whats记录在第1位， new记录在第2位。in位于第3位， PostgreSQL位于第4位，而12位于第5位。

然后创建一个 GIN索引。

pgbench=# CREATE INDEX idx_gin_doc ON documents USING gin(doc);
ERROR:  data type text has no default operator class for access method "gin"
HINT:  You must specify an operator class for the index or define a default operator class for the data type.

pgbench=# CREATE INDEX idx_gin_doc ON documents USING gin(doc);
ERROR:  data type text has no default operator class for access method "gin"
HINT:  You must specify an operator class for the index or define a default operator class for the data type.

在这里，直接在 text列上创建将报data type类型错误，它只能建在tsvector类型上。实际上，您也可以建在text文本列，但您必须使用to_tsvector函数进行转换，这就相当于是GIN的函数索引。为便于演示，这里我在类型为 tsvector的列上创建它。然后使用@@ to_tsquery进行文本匹配查询，能使用创建的GIN索引。

pgbench=# CREATE INDEX idx_gin_doc ON documents USING gin(doc_tsv);
CREATE INDEX

pgbench=# select doc from documents WHERE doc_tsv @@ to_tsquery('new');
                                         doc                                          
--------------------------------------------------------------------------------------
 Whats new in PostgreSQL 12
 PostgreSQL 12 provides some new features that are especially important to developers
(2 rows)

pgbench=# set enable_seqscan=off;
SET
pgbench=# explain analyze select * from documents WHERE doc_tsv @@ to_tsquery('new');
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on documents  (cost=8.25..12.51 rows=1 width=64) (actual time=0.020..0.020 rows=2 loops=1)
   Recheck Cond: (doc_tsv @@ to_tsquery('new'::text))
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on idx_gin_doc  (cost=0.00..8.25 rows=1 width=0) (actual time=0.014..0.014 rows=2 loops=1)
         Index Cond: (doc_tsv @@ to_tsquery('new'::text))
 Planning Time: 0.096 ms
 Execution Time: 0.046 ms
(7 rows)

pgbench=# CREATE INDEX idx_gin_doc ON documents USING gin(doc_tsv);
CREATE INDEX

pgbench=# select doc from documents WHERE doc_tsv @@ to_tsquery('new');
                                         doc                                          
--------------------------------------------------------------------------------------
 Whats new in PostgreSQL 12
 PostgreSQL 12 provides some new features that are especially important to developers
(2 rows)

pgbench=# set enable_seqscan=off;
SET
pgbench=# explain analyze select * from documents WHERE doc_tsv @@ to_tsquery('new');
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on documents  (cost=8.25..12.51 rows=1 width=64) (actual time=0.020..0.020 rows=2 loops=1)
   Recheck Cond: (doc_tsv @@ to_tsquery('new'::text))
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on idx_gin_doc  (cost=0.00..8.25 rows=1 width=0) (actual time=0.014..0.014 rows=2 loops=1)
         Index Cond: (doc_tsv @@ to_tsquery('new'::text))
 Planning Time: 0.096 ms
 Execution Time: 0.046 ms
(7 rows)

通过查询new关键字，可以快速查找到相关行，从而实现文本的全文检索。可以在此to_tsquery函数中使用多种运算符。例如&,|等运算符。让我们看看一些示例，例如，我希望同时搜索 new和 important。我会用&。

pgbench=# select doc from documents WHERE doc_tsv @@ to_tsquery('new & important');
                                         doc                                          
--------------------------------------------------------------------------------------
 PostgreSQL 12 provides some new features that are especially important to developers
(1 row)

pgbench=# select doc from documents WHERE doc_tsv @@ to_tsquery('new & important');
                                         doc                                          
--------------------------------------------------------------------------------------
 PostgreSQL 12 provides some new features that are especially important to developers
(1 row)

我想建索又有new，又有for的就使用|运算符。

pgbench=# select doc from documents WHERE doc_tsv @@ to_tsquery('new | for');      
                                         doc                                          
--------------------------------------------------------------------------------------
 Whats new in PostgreSQL 12
 PostgreSQL 12 provides some new features that are especially important to developers
 For the sake of simplicity
(3 rows)

pgbench=# select doc from documents WHERE doc_tsv @@ to_tsquery('new | for');      
                                         doc                                          
--------------------------------------------------------------------------------------
 Whats new in PostgreSQL 12
 PostgreSQL 12 provides some new features that are especially important to developers
 For the sake of simplicity
(3 rows)

中文支持

用英文分词很简单，如果用中文呢？让我们测试一下。

pgbench=# insert into documents(doc) values('在创建索引方面，您可以选择GIN或GIST索引');
INSERT 0 1
pgbench=# 
pgbench=# select * from documents;
                   doc                   | doc_tsv 
-----------------------------------------+---------
 在创建索引方面，您可以选择GIN或GIST索引 | 
(1 row)

pgbench=# update documents set doc_tsv=to_tsvector(doc);
UPDATE 1
pgbench=# select * from documents;
                   doc                   |                    doc_tsv                     
-----------------------------------------+------------------------------------------------
 在创建索引方面，您可以选择GIN或GIST索引 | '在创建索引方面':1 '您可以选择gin或gist索引':2
(1 row)

pgbench=# insert into documents(doc) values('在创建索引方面，您可以选择GIN或GIST索引');
INSERT 0 1
pgbench=# 
pgbench=# select * from documents;
                   doc                   | doc_tsv 
-----------------------------------------+---------
 在创建索引方面，您可以选择GIN或GIST索引 | 
(1 row)

pgbench=# update documents set doc_tsv=to_tsvector(doc);
UPDATE 1
pgbench=# select * from documents;
                   doc                   |                    doc_tsv                     
-----------------------------------------+------------------------------------------------
 在创建索引方面，您可以选择GIN或GIST索引 | '在创建索引方面':1 '您可以选择gin或gist索引':2
(1 row)

如果我们再用pg_trgm测试一下，就会发现它对中文分析支持也不友好。

pgbench=# select show_trgm('创建索引方面，您可以选择GIN或GIST索引');
                                                                                          show_trgm                                                                                           
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {0x9e1f65,0xa9ca35,0xada657,0xb24211,0xba5259,0xbcacb2,0xc5dbb9,0xd3850e,0xdd1b67,0xe36df1,0xe48fb5,0x0856a5,0x14bc15,0x15ab82,0x250642,0x272771,0x46a8bf,0x54dce1,0x5c2f58,gin,gis,ist,v1P}
(1 row)

pgbench=# select show_trgm('创建索引方面，您可以选择GIN或GIST索引');
                                                                                          show_trgm                                                                                           
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {0x9e1f65,0xa9ca35,0xada657,0xb24211,0xba5259,0xbcacb2,0xc5dbb9,0xd3850e,0xdd1b67,0xe36df1,0xe48fb5,0x0856a5,0x14bc15,0x15ab82,0x250642,0x272771,0x46a8bf,0x54dce1,0x5c2f58,gin,gis,ist,v1P}
(1 row)

这里查询的时候转换成了16进制，也是把所有单独的中文字分开了。例如，(您可以选择)，分词就会产生您/可/以/选/择。很明显，这是不对的，因为选择在这里是一个词。对于中文分词，需要通过专业的中文分析插件来实现，比如有一个插件叫 zhparser，但是作者有一段时间没更新了。