为啥考虑引入Clickhouse:

1、单表查询性能很高(日志场景下也不需要JOIN,刚好避开了它的弱点)

2、高压缩比

 

目前生产环境使用的数据同步方案
1、flink对微服务的topic数据清洗后,丢到一个新的Kafka的topic里面
2、2台ck完全独立部署,使用 clickhouse_sinker去消费数据(使用supervisor保活)
3、在2台ck前面配置SLB,前端展示可以用的redash (最好还是自研个后台查询界面,可以更好地限制查询的时间范围,避免badsql把ck搞OOM了)
 
基于Clickhouse的日志体系_Clickhouse
 
 
 
 
 
ck里面建表方法
 
step1、创建表:
实例1-2上都一样执行:
CREATE TABLE default.microsvc_local
(
    `timestamp` UInt64,
    `env` String,
    `service` String,
    `level` String,
    `trace` String,
    `span` String,
    `version` String,
    `requestId` String,
    `parent` String,
    `pid` String,
    `host` String,
    `thread` String,
    `lineNumber` String,
    `classText` String,
    `tags` String,
    `message` String,
    `path` String,
    `status` String,
    `stack_trace` String
)
ENGINE = MergeTree()
PARTITION BY toDate(floor(timestamp/1000))
ORDER BY timestamp
SETTINGS index_granularity = 8192;

注意的是 toDate(floor(timestamp/1000)) 这里, 因为kafka里面的时间戳是java的默认格式的,这里做分区表需要转成10位的时间戳。
我们这日志基本上不会有新增的列,都是微服务框架给包掉了,所以DDL的场景比较少。
 
 
 
 
 
简单的做些查询上的测试
select count(*) from microsvc_local ;

select count(*) from microsvc_local where `timestamp` > ? ;

select timestamp, service from microsvc_local order by `timestamp` desc limit 10 ;


# 查询某个时间范围内的总条数:
select count(*) from microsvc_local where timestamp  between ? and ?  ;

select timestamp from microsvc_local  where timestamp  >? limit 1  \G




模拟生产的几个查询条件(都带时间范围查询):
select * from microsvc_local
where timestamp
between ? and ?
and service='xxxx-service'
LIMIT 100
;
100 rows in set. Elapsed: 0.054 sec. Processed 2.65 million rows, 81.69 MB (49.55 million rows/s., 1.53 GB/s.)



select count(*) from microsvc_local
where timestamp
between toUnixTimestamp('2020-12-31 00:00:00') and toUnixTimestamp(now())  
and service='xxxx-service'
and message LIKE '%SpiderEvent%'
;
1 rows in set. Elapsed: 2.662 sec. Processed 83.72 million rows, 10.98 GB (31.45 million rows/s., 4.12 GB/s.)



select * from microsvc_local
where timestamp
between toUnixTimestamp('2020-12-31 00:00:00') and toUnixTimestamp(now())  
and service='xxxx-service'
and message LIKE '%SpiderEvent%'
order by `timestamp` desc
LIMIT 500
;
500 rows in set. Elapsed: 0.424 sec. Processed 12.16 million rows, 616.66 MB (28.70 million rows/s., 1.46 GB/s.)


select * from microsvc_local 
where timestamp 
between toUnixTimestamp('2020-12-31 10:35:00') and toUnixTimestamp(now()) 
and service='xxxx-service' 
and requestId='ac8fc0fb-7efd-4d6a-a176-54327b9b5b37'  \G
1 rows in set. Elapsed: 0.724 sec. Processed 27.31 million rows, 2.14 GB (37.74 million rows/s., 2.95 GB/s.)


select * from microsvc_local 
where timestamp between toUnixTimestamp('2020-12-31 10:35:00') and toUnixTimestamp(now()) 
and service='xxxx-service' 
and host='app11' 
order by timestamp desc 
limit 100 \G
100 rows in set. Elapsed: 0.611 sec. Processed 25.48 million rows, 1.79 GB (41.68 million rows/s., 2.93 GB/s.)


select * from microsvc_local 
where timestamp between toUnixTimestamp('2020-12-31 10:35:00') and toUnixTimestamp(now()) 
and service='xxxx-service' 
and host='app022' 
and requestId LIKE 'aaab2c96-ce40-41af-a9ca-8a94b6a89fa%' 
order by timestamp desc 
limit 500 \G
342 rows in set. Elapsed: 1.363 sec. Processed 27.62 million rows, 3.21 GB (20.26 million rows/s., 2.35 GB/s.)


select *  from microsvc_local 
where timestamp between toUnixTimestamp('2020-12-31 11:00:00') and toUnixTimestamp('2020-12-31 14:00:00') 
and service='xxx-service'  
and host='xxx-service-fd94d7b5-tn7g8' 
order by timestamp desc 
limit 10 ;
10 rows in set. Elapsed: 0.041 sec. Processed 211.90 thousand rows, 43.88 MB (5.23 million rows/s., 1.08 GB/s.)


查询如果时间范围不是很大,性能还可以接受。
 
 
 
分区操作
查询当前有哪些分区:
SELECT partition,name,partition_id FROM system.parts WHERE table = 'microsvc_local' LIMIT 20 ;
┌─partition──┬─name─────────────────────┬─partition_id─┐
│ 2021-03-07 │ 20210307_1_12873_7       │ 20210307     │
│ 2021-03-07 │ 20210307_12875_25819_7   │ 20210307     │
│ 2021-03-07 │ 20210307_25821_36943_7   │ 20210307     │
│ 2021-03-07 │ 20210307_36945_39408_6   │ 20210307     │
│ 2021-03-07 │ 20210307_39410_40600_6   │ 20210307     │
│ 2021-03-07 │ 20210307_40602_41425_5   │ 20210307     │
│ 2021-03-07 │ 20210307_41427_41679_4   │ 20210307     │
│ 2021-03-07 │ 20210307_41681_42187_5   │ 20210307     │
│ 2021-03-07 │ 20210307_42189_43145_6   │ 20210307     │
│ 2021-03-07 │ 20210307_43147_43173_2   │ 20210307     │
│ 2021-03-07 │ 20210307_43175_43175_0   │ 20210307     │
│ 2021-03-07 │ 20210307_43202_43202_0   │ 20210307     │
│ 2021-03-08 │ 20210308_7606_49861_11   │ 20210308     │
│ 2021-03-08 │ 20210308_49862_65750_8   │ 20210308     │
│ 2021-03-08 │ 20210308_65751_75740_7   │ 20210308     │
│ 2021-03-08 │ 20210308_75741_93936_7   │ 20210308     │
│ 2021-03-08 │ 20210308_93938_97957_6   │ 20210308     │
│ 2021-03-08 │ 20210308_97959_110001_7  │ 20210308     │
│ 2021-03-08 │ 20210308_110004_112581_6 │ 20210308     │
│ 2021-03-08 │ 20210308_112584_113615_5 │ 20210308     │
└────────────┴──────────────────────────┴──────────────┘


删除分区的方法:
alter table microsvc_local drop partition '2021-03-07';

参考: https://clickhouse.tech/docs/zh/engines/table-engines/mergetree-family/mergetree/
 
 
磁盘空间占用的对比
clickhouse:  2天用了40GB,日均20GB, 算上数据双写,带副本的情况下 日均40GB磁盘占用。
同样情况下的ES,每天数据量 94GB,带副本的情况下 日均180GB磁盘占用。是clickhouse的4倍多。
 
写入性能
通过clickhouse-sinker往ck里写数据,大约在53w每分钟,约9k每秒。速度还是很赞的。同时,写入期间,ck的负载变化不大。
 
 
 
 
前端界面
前端界面不擅长,就参考专业同学的模块,改了改,大致如下:
主要是限制住了必须传开始时间和结束时间、服务的名称,这样基本上就可以限制住查询的内存占用了。
基于Clickhouse的日志体系_Clickhouse_02
 
 
 
 
 
当然,有了flink后,我们还可在flink另外开一个实时统计的任务,统计每个微服务的分钟级的报错情况。这块就是偏java开发层面了。 可以看下我们架构组的同学的效果:
基于Clickhouse的日志体系_flink_03
 
 
 
 
 
 
 
附录:
 
查看连接:
SHOW PROCESSLIST;
SELECT query_id, user, address, elapsed, query   FROM system.processes ORDER BY query_id;

查看更详细的连接情况:
SELECT * FROM system.metrics WHERE metric LIKE '%Connection';


杀连接(先查出query_id):
SHOW PROCESSLIST;  
KILL QUERY WHERE query_id='query_id';



SELECT * FROM system.mutations;
KILL MUTATION mutation_id = 'trx_id';


磁盘空间占用
SELECT database, table, partition, name part_name, active, bytes_on_disk
  FROM system.parts ORDER BY database, table, partition, name;


SELECT database, sum(bytes_on_disk)
  FROM system.parts
GROUP BY database;



压缩因子非常重要,每列的压缩因子都不相同。 这是检查每列空间使用情况的查询:
SELECT database, table, column, any(type),
       sum(column_data_compressed_bytes) compressed,
       sum(column_data_uncompressed_bytes) uncompressed,
       uncompressed/compressed ratio,
       compressed/sum(rows) bpr,
       sum(rows)
  FROM system.parts_columns
WHERE active AND database <> 'system'
GROUP BY database, table, column
ORDER BY database, table, column;


默认情况下,CH不会跟踪执行的查询,但是可以通过在会话级别或在users.xml配置文件中设置参数log_queries = 1来进行跟踪。 我强烈建议启用它。


发现运行时间最长的查询的方法如下:
SELECT  user,
    client_hostname AS host,
    client_name AS client,
    formatDateTime(query_start_time, '%T') AS started,
    query_duration_ms / 1000 AS sec,
    round(memory_usage / 1048576) AS MEM_MB,
    result_rows AS RES_CNT,
    toDecimal32(result_bytes / 1048576, 6) AS RES_MB,
    read_rows AS R_CNT,
    round(read_bytes / 1048576) AS R_MB,
    written_rows AS W_CNT,
    round(written_bytes / 1048576) AS W_MB,
    query
  FROM system.query_log
WHERE type = 2
ORDER BY query_duration_ms DESC
LIMIT 10;



一旦确定了长期运行的查询,就可以开始对其进行优化。 主要技术是:正确选择ORDER BY列,编解码器和编码。 有关更多详细信息,请参考Altinity网络研讨会。
物化视图是提高性能的一项特别有用的功能,它允许您定义数据的替代视图。 物化视图可以合并数据或对数据进行不同排序。 分析最繁琐且最经常出现的查询,可以解决物化视图的设计问题。
如果您使用最新的ClickHouse版本(20.3.x),它将日志存储在system.metric_log表中,该表使您可以使用SQL进入OS级数据:
SELECT toStartOfMinute(event_time) AS time,
       sum(ProfileEvent_UserTimeMicroseconds) AS user_time,
       bar(user_time, 0, 60000000, 80) AS bar
  FROM system.metric_log
WHERE event_date = today()
GROUP BY time ORDER BY time;



可以配置群集中连接的更多ClickHouse节点。
ClickHouse群集允许HA复制(高可用性)以及并发访问和分片,以进行分布式查询和提高INSERT吞吐量。 该配置非常灵活,因为可以为单个表定义复制和分片。
ClickHouse复制是异步和多主复制(内部使用ZooKeeper for Quorum)。 其主要目标是医管局,但如果出现问题,请按照以下方法检查可能发生的各种“坏”情况:

SELECT database, table, is_leader, total_replicas, active_replicas
  FROM system.replicas
WHERE is_readonly
    OR is_session_expired
    OR future_parts > 20
    OR parts_to_check > 10
    OR queue_size > 20
    OR inserts_in_queue > 10
    OR log_max_index - log_pointer > 10
    OR total_replicas < 2
    OR active_replicas < total_replicas;