I want a random selection of rows in PostgreSQL, I tried this: 我想在PostgreSQL中随机选择一行,我试过这个:

select * from table where random() < 0.01;

But some other recommend this: 但其他一些人推荐这个:

select * from table order by random() limit 1000;

I have a very large table with 500 Million rows, I want it to be fast. 我有一个非常大的桌子,有5亿行,我希望它快。

Which approach is better? 哪种方法更好? What are the differences? 有什么区别? What is the best way to select random rows? 选择随机行的最佳方法是什么?


#1楼

参考:https://stackoom.com/question/AoGO/选择随机行PostgreSQL的最佳方法


#2楼

postgresql order by random(), select rows in random order: postgresql order by random(),按随机顺序选择行:

select your_columns from your_table ORDER BY random()

postgresql order by random() with a distinct: postgresql以random()顺序排列:

select * from 
  (select distinct your_columns from your_table) table_alias
ORDER BY random()

postgresql order by random limit one row: postgresql命令随机限制一行:

select your_columns from your_table ORDER BY random() limit 1

#3楼

 

Say, for example, that you don't want duplicates in the randomized values that are returned. 例如,假设您不希望在返回的随机值中出现重复项。 So you will need to set a boolean value on the primary table containing your (non-randomized) set of values. 因此,您需要在包含(非随机)值集的主表上设置布尔值。

Assuming this is the input table: 假设这是输入表:

id_values  id  |   used
           ----+--------
           1   |   FALSE
           2   |   FALSE
           3   |   FALSE
           4   |   FALSE
           5   |   FALSE
           ...

Populate the ID_VALUES table as needed. 根据需要填充ID_VALUES表。 Then, as described by Erwin, create a materialized view that randomizes the ID_VALUES table once: 然后,如Erwin所述,创建一个物化视图,将ID_VALUES表随机化一次:

CREATE MATERIALIZED VIEW id_values_randomized AS
  SELECT id
  FROM id_values
  ORDER BY random();

CREATE MATERIALIZED VIEW id_values_randomized AS
  SELECT id
  FROM id_values
  ORDER BY random();

Note that the materialized view does not contain the used column, because this will quickly become out-of-date. 请注意,实例化视图不包含已使用的列,因为这将很快变得过时。 Nor does the view need to contain other columns that may be in the id_values table. 视图也不需要包含可能在id_values表中的其他列。

In order to obtain (and "consume") random values, use an UPDATE-RETURNING on id_values , selecting id_values from id_values_randomized with a join, and applying the desired criteria to obtain only relevant possibilities. 为了获得(并“消耗”)随机值,在id_values上使用UPDATE- id_values ,从连接中选择id_valuesid_values_randomized ,并应用所需的条件以仅获得相关的可能性。 For example: 例如:

UPDATE id_values
SET used = TRUE
WHERE id_values.id IN 
  (SELECT i.id
    FROM id_values_randomized r INNER JOIN id_values i ON i.id = r.id
    WHERE (NOT i.used)
    LIMIT 5)
RETURNING id;

UPDATE id_values
SET used = TRUE
WHERE id_values.id IN 
  (SELECT i.id
    FROM id_values_randomized r INNER JOIN id_values i ON i.id = r.id
    WHERE (NOT i.used)
    LIMIT 5)
RETURNING id;

Change LIMIT as necessary -- if you only need one random value at a time, change LIMIT to 1 . 根据需要更改LIMIT - 如果一次只需要一个随机值,则将LIMIT更改为1

With the proper indexes on id_values , I believe the UPDATE-RETURNING should execute very quickly with little load. 使用id_values上的正确索引,我相信UPDATE-RETURNING应该在很少负载的情况下快速执行。 It returns randomized values with one database round-trip. 它返回一个数据库往返的随机值。 The criteria for "eligible" rows can be as complex as required. “符合条件”行的标准可以根据需要复杂化。 New rows can be added to the id_values table at any time, and they will become accessible to the application as soon as the materialized view is refreshed (which can likely be run at an off-peak time). 可以随时将新行添加到id_values表中, id_values化视图(可能在非高峰时间运行),它们就可以被应用程序访问。 Creation and refresh of the materialized view will be slow, but it only needs to be executed when new id's are added to the id_values table. 物化视图的创建和刷新将很慢,但只有在将新ID添加到id_values表时才需要执行。


#4楼

If you want just one row, you can use a calculated offset derived from count . 如果只需要一行,则可以使用从count派生的计算offset

select * from table_name limit 1
offset floor(random() * (select count(*) from table_name));

select * from table_name limit 1
offset floor(random() * (select count(*) from table_name));

#5楼

Add a column called r with type serial . 添加名为r的列,类型为serial Index r . 指数r

Assume we have 200,000 rows, we are going to generate a random number n , where 0 < n <= 200, 000. 假设我们有200,000行,我们将生成一个随机数n ,其中0 < n <= 200,000。

Select rows with r > n , sort them ASC and select the smallest one. 选择r > n行,将它们排序为ASC并选择最小的行。

Code: 码:

select * from YOUR_TABLE 
where r > (
    select (
        select reltuples::bigint AS estimate
        from   pg_class
        where  oid = 'public.YOUR_TABLE'::regclass) * random()
    )
order by r asc limit(1);

 

In application level you need to execute the statement again if n > the number of rows or need to select multiple rows. 在应用程序级别,如果n >行数或需要选择多行,则需要再次执行该语句。


#6楼

Starting with PostgreSQL 9.5, there's a new syntax dedicated to getting random elements from a table : 从PostgreSQL 9.5开始,有一种新的语法专用于从表中获取随机元素:

SELECT * FROM mytable TABLESAMPLE SYSTEM (5);

This example will give you 5% of elements from mytable . 这个例子将为mytable提供5%的元素。