测试环境
- 个人 MacBook 笔记本
- 内存 32 GB
- CPU 6 cores
- 本地 SSD
- GPDB 分配 5个 Segment Instance
- CK 单 Server
NYC taxi data 简单测试对比
首先我们今天测试的参考内容源自这里 -> https://clickhouse.com/docs/en/tutorial
1. CK 数据加载
首先我们根据官网指导,创建表:
CREATE TABLE trips
(
`trip_id` UInt32,
`vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
`pickup_date` Date,
`pickup_datetime` DateTime,
`dropoff_date` Date,
`dropoff_datetime` DateTime,
`store_and_fwd_flag` UInt8,
`rate_code_id` UInt8,
`pickup_longitude` Float64,
`pickup_latitude` Float64,
`dropoff_longitude` Float64,
`dropoff_latitude` Float64,
`passenger_count` UInt8,
`trip_distance` Float64,
`fare_amount` Float32,
`extra` Float32,
`mta_tax` Float32,
`tip_amount` Float32,
`tolls_amount` Float32,
`ehail_fee` Float32,
`improvement_surcharge` Float32,
`total_amount` Float32,
`payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
`trip_type` UInt8,
`pickup` FixedString(25),
`dropoff` FixedString(25),
`cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
`pickup_nyct2010_gid` Int8,
`pickup_ctlabel` Float32,
`pickup_borocode` Int8,
`pickup_ct2010` String,
`pickup_boroct2010` String,
`pickup_cdeligibil` String,
`pickup_ntacode` FixedString(4),
`pickup_ntaname` String,
`pickup_puma` UInt16,
`dropoff_nyct2010_gid` UInt8,
`dropoff_ctlabel` Float32,
`dropoff_borocode` UInt8,
`dropoff_ct2010` String,
`dropoff_boroct2010` String,
`dropoff_cdeligibil` String,
`dropoff_ntacode` FixedString(4),
`dropoff_ntaname` String,
`dropoff_puma` UInt16
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY pickup_datetime;
接下来,我们加载数据,CK 提供 S3 上的数据操作函数,可以一条命令下载并插入数据到表中,还是比较方便的:
INSERT INTO trips
SELECT * FROM s3(
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz',
'TabSeparatedWithNames'
)
我家里的网络条件一般般,最后下载入库花了大概 4 分钟:
Query id: 62eeb743-7f34-441a-8e8b-3bd7f82163db
Ok.
0 rows in set. Elapsed: 221.145 sec. Processed 2.00 million rows, 974.44 MB (9.04 thousand rows/s., 4.41 MB/s.)
ClickHouse.localdomain :) select count(*) from trips;
SELECT count(*)
FROM trips
Query id: 7c0d770f-a161-4ad5-a59b-b9320a192d90
┌─count()─┐
│ 1999657 │
└─────────┘
1 row in set. Elapsed: 0.002 sec.
ClickHouse.localdomain :)
如果你家里的网络条件不好,也可以选择将文件下载到本地再入库,方法如下(上面如果已经将数据入库,请忽略这部分内容,直接看第2部分):
先下载两个文件,然后再解压缩,导入 CK:
wget https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz
gunzip trips_1.gz
gunzip trips_2.gz
clickhouse-client --query "INSERT INTO trips FORMAT TabSeparatedWithNames" --max_insert_block_size=100000 --password < trips_1
clickhouse-client --query "INSERT INTO trips FORMAT TabSeparatedWithNames" --max_insert_block_size=100000 --password < trips_2
2. GP 数据加载
同样,在 GP 里也创建表,因为 GP 中默认没有 ENUM 类型,所以这里索性直接使用 int 替代 CK 表结构中的 ENUM,其他数据类型也都做了简单的对应转换:
CREATE TABLE trips
(
trip_id int,
vendor_id text,
pickup_date Date,
pickup_datetime timestamp(0),
dropoff_date Date,
dropoff_datetime timestamp(0),
store_and_fwd_flag smallint,
rate_code_id smallint,
pickup_longitude numeric,
pickup_latitude numeric,
dropoff_longitude numeric,
dropoff_latitude numeric,
passenger_count smallint,
trip_distance numeric,
fare_amount numeric,
extra numeric,
mta_tax numeric,
tip_amount numeric,
tolls_amount numeric,
ehail_fee numeric,
improvement_surcharge numeric,
total_amount numeric,
payment_type text,
trip_type smallint,
pickup char(25),
dropoff char(25),
cab_type text,
pickup_nyct2010_gid smallint,
pickup_ctlabel numeric,
pickup_borocode smallint,
pickup_ct2010 text,
pickup_boroct2010 text,
pickup_cdeligibil text,
pickup_ntacode char(4),
pickup_ntaname text,
pickup_puma smallint,
dropoff_nyct2010_gid smallint,
dropoff_ctlabel numeric,
dropoff_borocode smallint,
dropoff_ct2010 text,
dropoff_boroct2010 text,
dropoff_cdeligibil text,
dropoff_ntacode char(4),
dropoff_ntaname text,
dropoff_puma smallint,
primary key(trip_id,pickup_datetime)
);
接下来我们将数据装载到 GP 中,因为数据量不大,为了简化步骤就不使用gpload或者gpfdist这种入库方式了,直接用最简单的 COPY 入库:
下载数据并解压:
wget https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{1..2}.gz
gunzip trips_1.gz
gunzip trips_2.gz
这里注意,我看了一下文件的编码格式,是 latin1的,所以需要先转换一下,改为 UTF8:
iconv -f latin1 -t utf8 -o trips_1.tsv trips_1
iconv -f latin1 -t utf8 -o trips_2.tsv trips_2
然后进入 psql 使用 COPY 加载数据:
postgres=# select count(*) from trips;
count
-------
0
(1 row)
postgres=# copy trips from '/home/odb1/trips_1' with header;
ERROR: invalid byte sequence for encoding "UTF8": 0x95
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
CONTEXT: COPY trips, line 4
postgres=# copy trips from '/home/odb1/trips_1.tsv' with header;
COPY 999832
postgres=# copy trips from '/home/odb1/trips_2.tsv' with header;
COPY 999825
postgres=# select count(*) from trips;
count
---------
1999657
(1 row)
3. 对比测试及分析
如果大家感觉到 SQL 对比执行过程较为繁琐,可以直接跳转到第4部分查看对比结果。
3.1 count
ClickHouse.localdomain :) SELECT count() FROM trips;
SELECT count()
FROM trips
Query id: 8ac52d6d-3515-47ed-9afb-14efedaa1cb9
┌─count()─┐
│ 1999657 │
└─────────┘
1 row in set. Elapsed: 0.002 sec.
postgres=# SELECT count(*) FROM trips;
count
---------
1999657
(1 row)
Time: 75.702 ms
分析:CK维护了 count 索引,count 简直是 PostgreSQL 生态数据库的痛啊,性能差了37倍。
3.2 DISTINCT
SELECT DISTINCT(pickup_ntaname) FROM trips;
190 rows in set. Elapsed: 0.033 sec. Processed 2.00 million rows, 60.32 MB (61.25 million rows/s., 1.85 GB/s.)
SELECT DISTINCT(pickup_ntaname) FROM trips;
Time: 161.792 ms
分析:GP 慢了将近5倍
3.3 avg
ClickHouse.localdomain :) SELECT avg(tip_amount) FROM trips
SELECT avg(tip_amount)
FROM trips
Query id: 5d316394-9263-45c6-a8a3-d135701f71c3
┌────avg(tip_amount)─┐
│ 1.6847585845085191 │
└────────────────────┘
1 row in set. Elapsed: 0.010 sec. Processed 2.00 million rows, 8.00 MB (193.95 million rows/s., 775.81 MB/s.)
postgres=# SELECT avg(tip_amount) FROM trips;
avg
--------------------
1.6847585810966581
(1 row)
Time: 114.005 ms
分析:GP 慢了11倍
3.4 group by + avg
ClickHouse.localdomain :) SELECT
passenger_count,
ceil(avg(total_amount),2) AS average_total_amount
FROM trips
GROUP BY passenger_count;
SELECT
passenger_count,
ceil(avg(total_amount), 2) AS average_total_amount
FROM trips
GROUP BY passenger_count
Query id: 1c875b66-88bf-466c-aa66-bf474333dd03
┌─passenger_count─┬─average_total_amount─┐
│ 0 │ 22.69 │
│ 1 │ 15.97 │
│ 2 │ 17.15 │
│ 3 │ 16.76 │
│ 4 │ 17.33 │
│ 5 │ 16.35 │
│ 6 │ 16.04 │
│ 7 │ 59.8 │
│ 8 │ 36.41 │
│ 9 │ 9.81 │
└─────────────────┴──────────────────────┘
10 rows in set. Elapsed: 0.029 sec. Processed 2.00 million rows, 10.00 MB (68.40 million rows/s., 342.01 MB/s.)
postgres=# SELECT
passenger_count,
round(avg(total_amount),2) AS average_total_amount
FROM trips
GROUP BY passenger_count;
passenger_count | average_total_amount
-----------------+----------------------
7 | 59.79
6 | 16.03
4 | 17.32
3 | 16.75
5 | 16.35
9 | 9.80
8 | 36.40
1 | 15.97
2 | 17.14
0 | 22.68
(10 rows)
Time: 169.630 ms
分析:GP 慢了将近6倍
3.5 group by + order by
SELECT
pickup_date,
pickup_ntaname,
SUM(1) AS number_of_trips
FROM trips
GROUP BY pickup_date, pickup_ntaname
ORDER BY pickup_date ASC;
8751 rows in set. Elapsed: 0.032 sec. Processed 2.00 million rows, 64.32 MB (61.54 million rows/s., 1.98 GB/s.)
SELECT
pickup_date,
pickup_ntaname,
SUM(1) AS number_of_trips
FROM trips
GROUP BY pickup_date, pickup_ntaname
ORDER BY pickup_date ASC;
Time: 201.707 ms
分析:GP 慢了6倍多
3.6 filter + group by + order by + avg + count
ClickHouse.localdomain :) SELECT
avg(tip_amount) AS avg_tip,
avg(fare_amount) AS avg_fare,
avg(passenger_count) AS avg_passenger,
count() AS count,
truncate(date_diff('second', pickup_datetime, dropoff_datetime)/3600) as trip_minutes
FROM trips
WHERE trip_minutes > 0
GROUP BY trip_minutes
ORDER BY trip_minutes DESC
SELECT
avg(tip_amount) AS avg_tip,
avg(fare_amount) AS avg_fare,
avg(passenger_count) AS avg_passenger,
count() AS count,
truncate(dateDiff('second', pickup_datetime, dropoff_datetime) / 3600) AS trip_minutes
FROM trips
WHERE trip_minutes > 0
GROUP BY trip_minutes
ORDER BY trip_minutes DESC
Query id: a9375802-23cf-408f-9173-faf08b70a23b
┌────────────avg_tip─┬───────────avg_fare─┬──────avg_passenger─┬─count─┬─trip_minutes─┐
│ 0.9800000190734863 │ 10 │ 1.5 │ 2 │ 458 │
│ 1.1823678949841587 │ 14.493377928590297 │ 2.060200668896321 │ 1495 │ 23 │
│ 2.1159574744549206 │ 23.22872340425532 │ 2.4680851063829787 │ 47 │ 22 │
│ 1.1218181523409756 │ 13.681818181818182 │ 1.9090909090909092 │ 11 │ 21 │
│ 0.3218181837688793 │ 18.045454545454547 │ 2.3636363636363638 │ 11 │ 20 │
│ 2.1490000009536745 │ 17.55 │ 1.5 │ 10 │ 19 │
│ 4.537058907396653 │ 37 │ 1.7647058823529411 │ 17 │ 18 │
│ 1.425000006189713 │ 30.38846177321214 │ 1.4230769230769231 │ 26 │ 17 │
│ 1.770499986410141 │ 20.05 │ 2.2 │ 20 │ 16 │
│ 1.4908695765163587 │ 22.782608695652176 │ 2.130434782608696 │ 23 │ 15 │
│ 1.44157895602678 │ 23.209999887566816 │ 2.3157894736842106 │ 19 │ 14 │
│ 7.708461495546194 │ 27.576923076923077 │ 1.7692307692307692 │ 13 │ 13 │
│ 2.814444406165017 │ 23.072222179836697 │ 1.9444444444444444 │ 18 │ 12 │
│ 2.031333351135254 │ 17.733333333333334 │ 1.9333333333333333 │ 15 │ 11 │
│ 1.7359999671578408 │ 22.275 │ 2.3 │ 20 │ 10 │
│ 1.9605555633703868 │ 32.486111111111114 │ 2.3055555555555554 │ 36 │ 9 │
│ 1.9634782464607903 │ 25.608695652173914 │ 1.9130434782608696 │ 23 │ 8 │
│ 2.5812499995032945 │ 26.041666666666668 │ 1.5833333333333333 │ 24 │ 7 │
│ 1.5506451552914036 │ 22.838709677419356 │ 2.225806451612903 │ 31 │ 6 │
│ 1.6681250001420267 │ 43.46875 │ 2.25 │ 32 │ 5 │
│ 0.5895161282631659 │ 19.252580604245587 │ 0.6451612903225806 │ 62 │ 4 │
│ 1.0776315849078328 │ 44.26315789473684 │ 1.763157894736842 │ 38 │ 3 │
│ 5.2876000171899795 │ 61.835500000016765 │ 1.875 │ 200 │ 2 │
│ 6.039333844442844 │ 54.97734304437347 │ 1.8110571598943692 │ 11739 │ 1 │
└────────────────────┴────────────────────┴────────────────────┴───────┴──────────────┘
24 rows in set. Elapsed: 0.034 sec. Processed 2.00 million rows, 33.99 MB (58.78 million rows/s., 999.18 MB/s.)
ClickHouse.localdomain :)
postgres=# select
avg(tip_amount) AS avg_tip,
avg(fare_amount) AS avg_fare,
avg(passenger_count) AS avg_passenger,
count(*) AS count,
trip_minutes
from
(
SELECT
tip_amount,
fare_amount,
passenger_count,
trunc(
(date_part('hour', dropoff_datetime-pickup_datetime)*3600
+ date_part('minute', dropoff_datetime-pickup_datetime)*60
+ date_part('second', dropoff_datetime-pickup_datetime)
)/3600
) as trip_minutes
FROM trips
)foo
WHERE trip_minutes > 0
group by trip_minutes
ORDER BY trip_minutes DESC;
avg_tip | avg_fare | avg_passenger | count | trip_minutes
------------------------+---------------------+-------------------+-------+--------------
1.1823678929765886 | 14.4933779264214047 | 2.06020066889632 | 1495 | 23
2.1159574468085106 | 23.2287234042553191 | 2.46808510638298 | 47 | 22
1.1218181818181818 | 13.6818181818181818 | 1.90909090909091 | 11 | 21
0.32181818181818181818 | 18.0454545454545455 | 2.36363636363636 | 11 | 20
2.1490000000000000 | 17.5500000000000000 | 1.5 | 10 | 19
4.5370588235294118 | 37.0000000000000000 | 1.76470588235294 | 17 | 18
1.4250000000000000 | 30.3884615384615385 | 1.42307692307692 | 26 | 17
1.7705000000000000 | 20.0500000000000000 | 2.2 | 20 | 16
1.4908695652173913 | 22.7826086956521739 | 2.1304347826087 | 23 | 15
1.4415789473684211 | 23.2100000000000000 | 2.31578947368421 | 19 | 14
7.7084615384615385 | 27.5769230769230769 | 1.76923076923077 | 13 | 13
2.8144444444444444 | 23.0722222222222222 | 1.94444444444444 | 18 | 12
2.0313333333333333 | 17.7333333333333333 | 1.93333333333333 | 15 | 11
1.7360000000000000 | 22.2750000000000000 | 2.3 | 20 | 10
1.9605555555555556 | 32.4861111111111111 | 2.30555555555556 | 36 | 9
1.9634782608695652 | 25.6086956521739130 | 1.91304347826087 | 23 | 8
2.5812500000000000 | 26.0416666666666667 | 1.58333333333333 | 24 | 7
1.5506451612903226 | 22.8387096774193548 | 2.2258064516129 | 31 | 6
1.6681250000000000 | 43.4687500000000000 | 2.25 | 32 | 5
0.58951612903225806452 | 19.2525806451612903 | 0.645161290322581 | 62 | 4
1.0776315789473684 | 44.2631578947368421 | 1.76315789473684 | 38 | 3
5.2449504950495050 | 61.3222772277227723 | 1.87128712871287 | 202 | 2
6.0393338444501235 | 54.9773430445523469 | 1.81105715989437 | 11739 | 1
(23 rows)
Time: 306.016 ms
postgres=#
分析:GP 慢了9倍
3.7 group by + order by
SELECT
pickup_ntaname,
toHour(pickup_datetime) as pickup_hour,
SUM(1) AS pickups
FROM trips
WHERE pickup_ntaname != ''
GROUP BY pickup_ntaname, pickup_hour
ORDER BY pickup_ntaname, pickup_hour
3120 rows in set. Elapsed: 0.045 sec. Processed 2.00 million rows, 68.32 MB (44.22 million rows/s., 1.51 GB/s.)
SELECT
pickup_ntaname,
date_part('hour',pickup_datetime) as pickup_hour,
SUM(1) AS pickups
FROM trips
WHERE pickup_ntaname != ''
GROUP BY pickup_ntaname, pickup_hour
ORDER BY pickup_ntaname, pickup_hour;
Time: 298.298 ms
分析:GP 慢了6倍多
3.8 LaGuardia or JFK 机场数据计算
SELECT
pickup_datetime,
dropoff_datetime,
total_amount,
pickup_nyct2010_gid,
dropoff_nyct2010_gid,
CASE
WHEN dropoff_nyct2010_gid = 138 THEN 'LGA'
WHEN dropoff_nyct2010_gid = 132 THEN 'JFK'
END AS airport_code,
EXTRACT(YEAR FROM pickup_datetime) AS year,
EXTRACT(DAY FROM pickup_datetime) AS day,
EXTRACT(HOUR FROM pickup_datetime) AS hour
FROM trips
WHERE dropoff_nyct2010_gid IN (132, 138)
ORDER BY pickup_datetime;
45299 rows in set. Elapsed: 0.034 sec. Processed 2.00 million rows, 27.99 MB (59.16 million rows/s., 828.16 MB/s.)
SELECT
pickup_datetime,
dropoff_datetime,
total_amount,
pickup_nyct2010_gid,
dropoff_nyct2010_gid,
CASE
WHEN dropoff_nyct2010_gid = 138 THEN 'LGA'
WHEN dropoff_nyct2010_gid = 132 THEN 'JFK'
END AS airport_code,
EXTRACT(YEAR FROM pickup_datetime) AS year,
EXTRACT(DAY FROM pickup_datetime) AS day,
EXTRACT(HOUR FROM pickup_datetime) AS hour
FROM trips
WHERE dropoff_nyct2010_gid IN (132, 138)
ORDER BY pickup_datetime;
Time: 208.689 ms
分析:GP 慢了6倍
4.总结
下面给大家总结了一张图,整体来看差距还是比较大的,CK 由于其单 server 单表的一些优秀特性,比如内存中索引的维护、稀疏索引、SIMD等,执行速度特别的快。