I want a random selection of rows in PostgreSQL, I tried this: 我想在PostgreSQL中随机选择一行,我试过这个:
select * from table where random() < 0.01;
But some other recommend this: 但其他一些人推荐这个:
select * from table order by random() limit 1000;
I have a very large table with 500 Million rows, I want it to be fast. 我有一个非常大的桌子,有5亿行,我希望它快。
Which approach is better? 哪种方法更好? What are the differences? 有什么区别? What is the best way to select random rows? 选择随机行的最佳方法是什么?
#1楼
参考:https://stackoom.com/question/AoGO/选择随机行PostgreSQL的最佳方法
#2楼
postgresql order by random(), select rows in random order: postgresql order by random(),按随机顺序选择行:
select your_columns from your_table ORDER BY random()
postgresql order by random() with a distinct: postgresql以random()顺序排列:
select * from
(select distinct your_columns from your_table) table_alias
ORDER BY random()
postgresql order by random limit one row: postgresql命令随机限制一行:
select your_columns from your_table ORDER BY random() limit 1
#3楼
Say, for example, that you don't want duplicates in the randomized values that are returned. 例如,假设您不希望在返回的随机值中出现重复项。 So you will need to set a boolean value on the primary table containing your (non-randomized) set of values. 因此,您需要在包含(非随机)值集的主表上设置布尔值。
Assuming this is the input table: 假设这是输入表:
id_values id | used
----+--------
1 | FALSE
2 | FALSE
3 | FALSE
4 | FALSE
5 | FALSE
...
Populate the ID_VALUES
table as needed. 根据需要填充ID_VALUES
表。 Then, as described by Erwin, create a materialized view that randomizes the ID_VALUES
table once: 然后,如Erwin所述,创建一个物化视图,将ID_VALUES
表随机化一次:
CREATE MATERIALIZED VIEW id_values_randomized AS
SELECT id
FROM id_values
ORDER BY random();
CREATE MATERIALIZED VIEW id_values_randomized AS
SELECT id
FROM id_values
ORDER BY random();
Note that the materialized view does not contain the used column, because this will quickly become out-of-date. 请注意,实例化视图不包含已使用的列,因为这将很快变得过时。 Nor does the view need to contain other columns that may be in the id_values
table. 视图也不需要包含可能在id_values
表中的其他列。
In order to obtain (and "consume") random values, use an UPDATE-RETURNING on id_values
, selecting id_values
from id_values_randomized
with a join, and applying the desired criteria to obtain only relevant possibilities. 为了获得(并“消耗”)随机值,在id_values
上使用UPDATE- id_values
,从连接中选择id_values
的id_values_randomized
,并应用所需的条件以仅获得相关的可能性。 For example: 例如:
UPDATE id_values
SET used = TRUE
WHERE id_values.id IN
(SELECT i.id
FROM id_values_randomized r INNER JOIN id_values i ON i.id = r.id
WHERE (NOT i.used)
LIMIT 5)
RETURNING id;
UPDATE id_values
SET used = TRUE
WHERE id_values.id IN
(SELECT i.id
FROM id_values_randomized r INNER JOIN id_values i ON i.id = r.id
WHERE (NOT i.used)
LIMIT 5)
RETURNING id;
Change LIMIT
as necessary -- if you only need one random value at a time, change LIMIT
to 1
. 根据需要更改LIMIT
- 如果一次只需要一个随机值,则将LIMIT
更改为1
。
With the proper indexes on id_values
, I believe the UPDATE-RETURNING should execute very quickly with little load. 使用id_values
上的正确索引,我相信UPDATE-RETURNING应该在很少负载的情况下快速执行。 It returns randomized values with one database round-trip. 它返回一个数据库往返的随机值。 The criteria for "eligible" rows can be as complex as required. “符合条件”行的标准可以根据需要复杂化。 New rows can be added to the id_values
table at any time, and they will become accessible to the application as soon as the materialized view is refreshed (which can likely be run at an off-peak time). 可以随时将新行添加到id_values
表中, id_values
化视图(可能在非高峰时间运行),它们就可以被应用程序访问。 Creation and refresh of the materialized view will be slow, but it only needs to be executed when new id's are added to the id_values
table. 物化视图的创建和刷新将很慢,但只有在将新ID添加到id_values
表时才需要执行。
#4楼
If you want just one row, you can use a calculated offset
derived from count
. 如果只需要一行,则可以使用从count
派生的计算offset
。
select * from table_name limit 1
offset floor(random() * (select count(*) from table_name));
select * from table_name limit 1
offset floor(random() * (select count(*) from table_name));
#5楼
Add a column called r
with type serial
. 添加名为r
的列,类型为serial
。 Index r
. 指数r
。
Assume we have 200,000 rows, we are going to generate a random number n
, where 0 < n
<= 200, 000. 假设我们有200,000行,我们将生成一个随机数n
,其中0 < n
<= 200,000。
Select rows with r > n
, sort them ASC
and select the smallest one. 选择r > n
行,将它们排序为ASC
并选择最小的行。
Code: 码:
select * from YOUR_TABLE
where r > (
select (
select reltuples::bigint AS estimate
from pg_class
where oid = 'public.YOUR_TABLE'::regclass) * random()
)
order by r asc limit(1);
In application level you need to execute the statement again if n
> the number of rows or need to select multiple rows. 在应用程序级别,如果n
>行数或需要选择多行,则需要再次执行该语句。
#6楼
Starting with PostgreSQL 9.5, there's a new syntax dedicated to getting random elements from a table : 从PostgreSQL 9.5开始,有一种新的语法专用于从表中获取随机元素:
SELECT * FROM mytable TABLESAMPLE SYSTEM (5);
This example will give you 5% of elements from mytable
. 这个例子将为mytable
提供5%的元素。