find_in_set 如何做索引优化 rum索引

转载

mob64ca140c75c7 2024-03-20 21:55:19

文章标签 find_in_set 如何做索引优化搜索词素运算符 文章分类 数据仓库大数据

RUM

为了全文搜索更加快，RUM索引可以看做是在GIN基础上的扩展。可以从https://github.com/postgrespro/rum下载使用。

使用GIN索引的一些限制

GIN索引允许使用tsvector和tsquery类型执行快速的全文本搜索。但是，使用GIN索引进行全文搜索存在几个问题：

排序慢。需要有关词汇的位置信息才能进行排序。 GIN索引不存储词素的位置信息。因此，在索引扫描之后，我们需要额外的堆扫描以检索词素位置。
短语搜索在GIN搜索中也比较慢。该问题与词素搜索的问题类似。需要位置信息来执行短语搜索。
时间戳排序缓慢。 GIN索引无法在带有词素的索引中存储相关信息。因此，要执行其他的堆扫描，有额外的其他开销。

下图显示了GIN和RUM的不同，添加了addinfo,可以存储位置或者时间戳。

find_in_set 如何做索引优化 rum索引_find_in_set 如何做索引优化

RUM的缺点是构建索引和插入数据花费的时间比GIN慢。这是因为我们需要存储除密钥之外的其他信息，并且RUM使用WAL记录。

短语搜索

全文搜索查询可以包含特殊运算符，这些运算符考虑了词素之间的距离。例如，我们可以找到文档中的«hand»与 «thigh»之间有两个不同的单词：

postgres=# select to_tsvector('Clap your hands, slap your thigh') @@
                  to_tsquery('hand <3> thigh');
 ?column?
----------
 t
(1 row)

我们也可以指出单词必须一个接一个地定位：

postgres=# select to_tsvector('Clap your hands, slap your thigh') @@
                  to_tsquery('hand <-> slap');
 ?column?
----------
 t
(1 row)

常规GIN索引可以返回包含两个词素的文档，但是我们只能通过查看tsvector来检查它们之间的距离：

postgres=# select to_tsvector('Clap your hands, slap your thigh');
             to_tsvector              
--------------------------------------
 'clap':1 'hand':3 'slap':4 'thigh':6
(1 row)

在RUM索引中，每个词素不仅引用表行,而且每个TID都提供了词素在文档中出现的位置列表。

postgres=# create extension rum;

postgres=# create index on ts using rum(doc_tsv);

find_in_set 如何做索引优化 rum索引_搜索_02

在上图中灰色方块包含了词素的位置信息

postgres=# select ctid, left(doc,20), doc_tsv from ts;
  ctid |         left         |                         doc_tsv                         
-------+----------------------+---------------------------------------------------------
 (0,1) | Can a sheet slitter  | 'sheet':3,6 'slit':5 'slitter':4
 (0,2) | How many sheets coul | 'could':4 'mani':2 'sheet':3,6 'slit':8 'slitter':7
 (0,3) | I slit a sheet, a sh | 'sheet':4,6 'slit':2,8
 (1,1) | Upon a slitted sheet | 'sheet':4 'sit':6 'slit':3 'upon':1
 (1,2) | Whoever slit the she | 'good':7 'sheet':4,8 'slit':2 'slitter':9 'whoever':1
 (1,3) | I am a sheet slitter | 'sheet':4 'slitter':5
 (2,1) | I slit sheets.       | 'sheet':3 'slit':2
 (2,2) | I am the sleekest sh | 'ever':8 'sheet':5,10 'sleekest':4 'slit':9 'slitter':6
 (2,3) | She slits the sheet  | 'sheet':4 'sit':6 'slit':2
(9 rows)

当指定了fastupdate参数时，GIN还提供了延迟插入。但在RUM中已经删除。
下面举一个类似生产环境的例子，数据来源mail_message

fts=# alter table mail_messages add column tsv tsvector;

fts=# set default_text_search_config = default;

fts=# update mail_messages
set tsv = to_tsvector(body_plain);
...
UPDATE 356125

通过GIN索引搜索的执行计划如下：

fts=# create index tsv_gin on mail_messages using gin(tsv);

fts=# explain (costs off, analyze)
select * from mail_messages where tsv @@ to_tsquery('hello <-> hackers');
                                   QUERY PLAN                                    
---------------------------------------------------------------------------------
 Bitmap Heap Scan on mail_messages (actual time=2.490..18.088 rows=259 loops=1)
   Recheck Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))
   Rows Removed by Index Recheck: 1517
   Heap Blocks: exact=1503
   ->  Bitmap Index Scan on tsv_gin (actual time=2.204..2.204 rows=1776 loops=1)
         Index Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))
 Planning time: 0.266 ms
 Execution time: 18.151 ms
(8 rows)

从计划中可以看出，通过使用GIN索引，返回了1776个匹配项，其中剩余259个匹配项，在recheck阶段删除了1517个匹配项。

让我们删除GIN，创建RUM索引

fts=# drop index tsv_gin;

fts=# create index tsv_rum on mail_messages using rum(tsv);

使用RUM索引搜索的执行计划，可以看到索引包含了所有需要查询的行，提高了查询效率

fts=# explain (costs off, analyze)
select * from mail_messages
where tsv @@ to_tsquery('hello <-> hackers');
                                   QUERY PLAN                                  
--------------------------------------------------------------------------------
 Bitmap Heap Scan on mail_messages (actual time=2.798..3.015 rows=259 loops=1)
   Recheck Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))
   Heap Blocks: exact=250
   ->  Bitmap Index Scan on tsv_rum (actual time=2.768..2.768 rows=259 loops=1)
         Index Cond: (tsv @@ to_tsquery('hello <-> hackers'::text))
 Planning time: 0.245 ms
 Execution time: 3.053 ms
(7 rows)

排序相关

为了方便地按所需顺序返回文档，RUM索引支持排序运算符，我们在与GiST相关的文章中对此进行了讨论。 RUM定义了运算符<=>，该运算符返回（«tsvector»）与（«tsquery»）之间的距离。例如：

fts=# select to_tsvector('Can a sheet slitter slit sheets?') <=>l to_tsquery('slit');
 ?column?
----------
  16.4493
(1 row)

fts=# select to_tsvector('Can a sheet slitter slit sheets?') <=> to_tsquery('sheet');
 ?column?
----------
  13.1595
(1 row)

在相对比较大的数据上比较GIN和RUM：

fts=# explain (costs off, analyze)
select * from mail_messages 
where tsv @@ to_tsquery('hello & hackers') 
order by ts_rank(tsv,to_tsquery('hello & hackers')) 
limit 10;
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Limit (actual time=27.076..27.078 rows=10 loops=1)
   ->  Sort (actual time=27.075..27.076 rows=10 loops=1)
         Sort Key: (ts_rank(tsv, to_tsquery('hello & hackers'::text)))
         Sort Method: top-N heapsort  Memory: 29kB
         ->  Bitmap Heap Scan on mail_messages (actual ... rows=1776 loops=1)
               Recheck Cond: (tsv @@ to_tsquery('hello & hackers'::text))
               Heap Blocks: exact=1503
               ->  Bitmap Index Scan on tsv_gin (actual ... rows=1776 loops=1)
                     Index Cond: (tsv @@ to_tsquery('hello & hackers'::text))
 Planning time: 0.276 ms
 Execution time: 27.121 ms
(11 rows)

GIN索引返回了1776行，然后通过排序选取了前10行。

下面使用了RUM索引，看到可以使用简单的索引扫描来执行查询，无序执行额外的查询，也不需要单独再进行排序。

fts=# explain (costs off, analyze)
select * from mail_messages
where tsv @@ to_tsquery('hello & hackers')
order by tsv <=> to_tsquery('hello & hackers')
limit 10;
                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Limit (actual time=5.083..5.171 rows=10 loops=1)
   ->  Index Scan using tsv_rum on mail_messages (actual ... rows=10 loops=1)
         Index Cond: (tsv @@ to_tsquery('hello & hackers'::text))
         Order By: (tsv <=> to_tsquery('hello & hackers'::text))
 Planning time: 0.244 ms
 Execution time: 5.207 ms
(6 rows)

额外信息

RUM索引和GIN索引都可以建立在多个字段上。但是，尽管GIN独立存储每一列中的词素，而RUM却使我们能够将主字段（在本例中为 tsvector）与附加字段关联。为此，我们需要使用专门的运算符类«rum_tsvector_addon_ops»：

fts=# create index on mail_messages using rum(tsv RUM_TSVECTOR_ADDON_OPS, sent)
  WITH (ATTACH='sent', TO='tsv');

我们能使用这个索引返回存储在额外字段排序的结果：

fts=# select id, sent, sent <=> '2017-01-01 15:00:00'
from mail_messages
where tsv @@ to_tsquery('hello')
order by sent <=> '2017-01-01 15:00:00'
limit 10;
   id    |        sent         | ?column? 
---------+---------------------+----------
 2298548 | 2017-01-01 15:03:22 |      202
 2298547 | 2017-01-01 14:53:13 |      407
 2298545 | 2017-01-01 13:28:12 |     5508
 2298554 | 2017-01-01 18:30:45 |    12645
 2298530 | 2016-12-31 20:28:48 |    66672
 2298587 | 2017-01-02 12:39:26 |    77966
 2298588 | 2017-01-02 12:43:22 |    78202
 2298597 | 2017-01-02 13:48:02 |    82082
 2298606 | 2017-01-02 15:50:50 |    89450
 2298628 | 2017-01-02 18:55:49 |   100549
(10 rows)

这里我们搜索和’2017-01-01 15:00:00’时间点最近的行，并按时间差进行排序。如果要获得指定日期之前或者之后的结构，可以使用 <=| 或者 |=>操作符。

ts=# explain (costs off)
select id, sent, sent <=> '2017-01-01 15:00:00' 
from mail_messages
where tsv @@ to_tsquery('hello')
order by sent <=> '2017-01-01 15:00:00'
limit 10;
                                   QUERY PLAN
---------------------------------------------------------------------------------
 Limit
   ->  Index Scan using mail_messages_tsv_sent_idx on mail_messages
         Index Cond: (tsv @@ to_tsquery('hello'::text))
         Order By: (sent <=> '2017-01-01 15:00:00'::timestamp without time zone)
(4 rows)

如果创建索引时没有字段关联的附加信息，则对于类似的查询，我们不得不对索引扫描的所有结果进行排序。

除了日期，我们也可以将其他数据类型的字段添加到RUM索引中。几乎所有基本类型都支持。例如，在线商店可以按新颖性（date），价格（numeric），受欢迎程度或折扣值（int,float）快速显示商品。

其他操作类

让我们从rum_tsvector_hash_ops和rum_tsvector_hash_addon_ops开始介绍，它们类似于已经讨论过的 rum_tsvector_ops和 rum_tsvector_addon_ops，但是索引存储的是词素的哈希码，而不是词素本身。这样可以减小索引的大小，当然搜索的准确性会降低，需要重新检查。此外，索引不再支持部分匹配的搜索。

rum_tsquery_ops操作类，它使我们能够解决“逆向”问题：比如查找与文档匹配的查询。例子：

fts=# create table categories(query tsquery, category text);

fts=# insert into categories values
  (to_tsquery('vacuum | autovacuum | freeze'), 'vacuum'),
  (to_tsquery('xmin | xmax | snapshot | isolation'), 'mvcc'),
  (to_tsquery('wal | (write & ahead & log) | durability'), 'wal');

fts=# create index on categories using rum(query);

fts=# select array_agg(category)
from categories
where to_tsvector(
  'Hello hackers, the attached patch greatly improves performance of tuple
   freezing and also reduces size of generated write-ahead logs.'
) @@ query;
  array_agg  
--------------
 {vacuum,wal}
(1 row)

剩余的操作类rum_anyarray_ops和rum_anyarray_addon_ops是操纵数组相关的。

索引大小和WAL（write-ahead log）

很显然，由于RUM比GIN存储更多的信息，因此RUM索引更大。

rum   |  gin   |  gist  | btree
--------+--------+--------+--------
 457 MB | 179 MB | 125 MB | 546 MB

可以看到大小比GIN大很多，这也是更加快速搜索的代价。

RUM是一个扩展插件，也就是说，可以在不对系统核心进行任何修改的情况下安装RUM。当时变动索引相关的WAL比GIN的更大，可以通过以下多次删除和插入数据，查看产生的日志量。

可以通过pg_current_wal_location(早起版本可以使用pg_current_xlog_location)函数查看日志的位移量，来查看产生日志的多少。

fts=# select pg_current_wal_location() as start_lsn \gset

fts=# insert into mail_messages(parent_id, sent, subject, author, body_plain, tsv)
  select parent_id, sent, subject, author, body_plain, tsv
  from mail_messages where id % 100 = 0;
INSERT 0 3576

fts=# delete from mail_messages where id % 100 = 99;
DELETE 3590

fts=# vacuum mail_messages;

fts=# insert into mail_messages(parent_id, sent, subject, author, body_plain, tsv)
  select parent_id, sent, subject, author, body_plain, tsv
  from mail_messages where id % 100 = 1;
INSERT 0 3605

fts=# delete from mail_messages where id % 100 = 98;
DELETE 3637

fts=# vacuum mail_messages;

fts=# insert into mail_messages(parent_id, sent, subject, author, body_plain, tsv)
  select parent_id, sent, subject, author, body_plain, tsv from mail_messages
  where id % 100 = 2;
INSERT 0 3625

fts=# delete from mail_messages where id % 100 = 97;
DELETE 3668

fts=# vacuum mail_messages;

fts=# select pg_current_wal_location() as end_lsn \gset
fts=# select pg_size_pretty(:'end_lsn'::pg_lsn - :'start_lsn'::pg_lsn);
 pg_size_pretty
----------------
 3114 MB
(1 row)

可以看到，WAL大约为3 GB。但是，如果我们对GIN索引重复相同的实验，则只会占用700 MB左右的空间。

索引相关属性

amname |     name      | pg_indexam_has_property
--------+---------------+-------------------------
 rum    | can_order     | f
 rum    | can_unique    | f
 rum    | can_multi_col | t
 rum    | can_exclude   | t -- f for gin

索引层相关属性

name      | pg_index_has_property
---------------+-----------------------
 clusterable   | f
 index_scan    | t -- f for gin
 bitmap_scan   | t
 backward_scan | f

请注意，与GIN不同，RUM支持索引扫描。否则，不可能在带有limit子句的查询中精确返回所需数目的结果。不需要相应地使用«gin_fuzzy_search_limit»参数。因此，该索引可用于支持排它约束。

列相关属性：

name        | pg_index_column_has_property 
--------------------+------------------------------
 asc                | f
 desc               | f
 nulls_first        | f
 nulls_last         | f
 orderable          | f
 distance_orderable | t -- f for gin
 returnable         | f
 search_array       | f
 search_nulls       | f

此处的区别在于RUM支持排序运算符。但是，并非对所有运算符类都是如此，例如，对于«tsquery_ops»来说为false。

参考：https://github.com/postgrespro/rum

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。