|NO.Z.00005|——————————|Deployment|——|Hadoop&OLAP

原创

yanqi_vip 2022-04-19 16:31:59 ©著作权

©著作权归作者所有：来自51CTO博客作者yanqi_vip的原创作品，请联系作者获取转载授权，否则将追究法律责任

一、HDFS

### --- HDFS

~~~     该引擎提供了集成 Apache Hadoop 生态系统通过允许管理数据 HDFS通过ClickHouse. 
~~~     这个引擎是相似的 到 文件和 URL 引擎，但提供Hadoop特定的功能。

### --- 用途
~~~     该 URI 参数是HDFS中的整个文件URI。 该 format 参数指定一种可用的文件格式。
~~~     执行 SELECT 查询时，格式必须支持输入，
~~~     并执行 INSERT queries – for output. The available formats are listed in the 格式 科。 
~~~     路径部分 URI 可能包含水珠。 在这种情况下，表将是只读的。

ENGINE = HDFS(URI, format)

二、HDFS示例

### --- 在Hadoop上创建clickhouse目录并创建参数文件

~~~     # 在hdfs下创建clickhouse目录
[root@hadoop01 ~]# hdfs dfs -mkdir /clickhouse
~~~     # 修改clickhouse文件的属主属组
[root@hadoop01 ~]# hdfs dfs -chown clickhouse:clickhouse /clickhouse

### --- 创建HDFS链接clickhouse表：hdfs_engine_table

~~~     # 设置hdfs_engine_table表
hadoop01 :) CREATE TABLE hdfs_engine_table (
            name String, 
            value UInt32
            ) ENGINE=HDFS('hdfs://hadoop01:9000/clickhouse/hdfs_engine_table', 'TSV');
~~~输出参数
CREATE TABLE hdfs_engine_table
(
    `name` String,
    `value` UInt32
)
ENGINE = HDFS('hdfs://hadoop01:9000/clickhouse/hdfs_engine_table', 'TSV')

Ok.

~~~     # 插入数据

hadoop01 :) INSERT INTO hdfs_engine_table VALUES ('one', 1), ('two', 2), ('three', 3);

~~~ 输出参数
INSERT INTO hdfs_engine_table VALUES

Ok.

### --- 查询clickhouse链接hdfs的数据：

~~~     # 查看clickhouse链接hdfs的数据
hadoop01 :) SELECT * FROM hdfs_engine_table LIMIT 2;

┌─name─┬─value─┐
│ one  │     1 │
│ two  │     2 │
└──────┴───────┘

### --- 查看hdfs下文件内容

~~~     # 查看hdfs下文件内容
[root@hadoop01 ~]# hdfs dfs -cat /clickhouse/hdfs_engine_table
one 1
two 2
three   3

三、实施细节

### --- 读取和写入可以并行不支持:

~~~     # ALTER 和 SELECT...SAMPLE 操作。
~~~     索引。
~~~     复制。

### --- 路径中的水珠

~~~     多个路径组件可以具有globs。 对于正在处理的文件应该存在并匹配到整个路径模式。
~~~     文件列表确定在 SELECT （不在 CREATE 时刻）。
~~~     — Substitutes any number of any characters except / 包括空字符串。
~~~     ? — Substitutes any single character.{some_string,another_string,yet_another_one}
~~~     — Substitutes any of strings 'some_string','another_string', 'yet_another_one'.
~~~     {N..M} — Substitutes any number in range from N to M including both borders.
~~~     建筑与 {} 类似于 远程 表功能。

四、实验操作示例

### --- 假设我们在HDFS上有几个TSV格式的文件，其中包含以下Uri:

~~~     # TSV格式数据
~~~     # 在hdfs上创建文件并授予clickhouse属主属组权限
[root@hadoop01 ~]# hdfs dfs -mkdir /clickhouse_dir
[root@hadoop01 ~]# hdfs dfs -put some_file_1 /clickhouse_dir
[root@hadoop01 ~]# hdfs dfs -put some_file_2 /clickhouse_dir
[root@hadoop01 ~]# hdfs dfs -put some_file_3 /clickhouse_dir
[root@hadoop01 ~]# hdfs dfs -chown -R clickhouse:clickhouse /clickhouse_dir

~~~     # 在hdfs上创建文件并授予clickhouse属主属组权限

[root@hadoop01 ~]# hdfs dfs -mkdir /another_dir
[root@hadoop01 ~]# hdfs dfs -put some_file_1 /another_dir
[root@hadoop01 ~]# hdfs dfs -put some_file_2 /another_dir
[root@hadoop01 ~]# hdfs dfs -put some_file_3 /another_dir
[root@hadoop01 ~]# hdfs dfs -chown -R clickhouse:clickhouse /another_dir

### --- 方式一：多个文件组成的表

~~~     # 方式一：创建由所有六个文件组成的表:
hadoop01 :) CREATE TABLE table_with_range (
            name String, 
            value UInt32
            ) ENGINE =HDFS('hdfs://hadoop01:9000/{clickhouse,another}_dir/some_file_{1..3}', 'TSV');
~~~ 输出参数
CREATE TABLE table_with_range
(
    `name` String,
    `value` UInt32
)
ENGINE = HDFS('hdfs://hadoop01:9000/{clickhouse,another}_dir/some_file_{1..3}', 'TSV')

Ok.

~~~     # 查询链接的数据

hadoop01 :) select * from table_with_range;

┌─name──┬─value─┐
│ one   │     1 │
│ two   │     2 │
│ three │     3 │
└───────┴───────┘
┌─name─┬─value─┐
│ four │     4 │
│ five │     5 │
│ six  │     6 │
└──────┴───────┘
┌─name──┬─value─┐
│ seven │     7 │
│ eight │     8 │
│ nine  │     9 │
└───────┴───────┘
┌─name──┬─value─┐
│ one   │     1 │
│ two   │     2 │
│ three │     3 │
└───────┴───────┘
┌─name─┬─value─┐
│ four │     4 │
│ five │     5 │
│ six  │     6 │
└──────┴───────┘
┌─name──┬─value─┐
│ seven │     7 │
│ eight │     8 │
│ nine  │     9 │
└───────┴───────┘

### --- 方式二：多个文件组成的表

~~~     # 方式二：创建由所有六个文件组成的表:
hadoop01 :) CREATE TABLE table_with_question_mark (
            name String, 
            value UInt32
            ) ENGINE =HDFS('hdfs://hadoop01:9000/{some,another}_dir/some_file_?', 'TSV');
~~~输出参数
CREATE TABLE table_with_question_mark
(
    `name` String,
    `value` UInt32
)
ENGINE = HDFS('hdfs://hadoop01:9000/{some,another}_dir/some_file_?', 'TSV')

Ok.

~~~     # 查看链接的数据

hadoop01 :) select * from table_with_question_mark;

┌─name──┬─value─┐
│ one   │     1 │
│ two   │     2 │
│ three │     3 │
└───────┴───────┘
┌─name─┬─value─┐
│ four │     4 │
│ five │     5 │
│ six  │     6 │
└──────┴───────┘
┌─name──┬─value─┐
│ seven │     7 │
│ eight │     8 │
│ nine  │     9 │
└───────┴───────┘

### --- 方式三：多个文件组成的表

~~~     # 表由两个目录中的所有文件组成（所有文件都应满足query中描述的格式和模式):
hadoop01 :) CREATE TABLE table_with_asterisk (
            name String, 
            value UInt32
            ) ENGINE =HDFS('hdfs://hadoop01:9000/{some,another}_dir/*', 'TSV');
~~~ 输出参数
CREATE TABLE table_with_asterisk
(
    `name` String,
    `value` UInt32
)
ENGINE = HDFS('hdfs://hadoop01:9000/{some,another}_dir/*', 'TSV')

Ok.

~~~     # 查看链接的数据

hadoop01 :) select * from table_with_asterisk;

┌─name──┬─value─┐
│ one   │     1 │
│ two   │     2 │
│ three │     3 │
└───────┴───────┘
┌─name─┬─value─┐
│ four │     4 │
│ five │     5 │
│ six  │     6 │
└──────┴───────┘
┌─name──┬─value─┐
│ seven │     7 │
│ eight │     8 │
│ nine  │     9 │
└───────┴───────┘

### --- 警告
~~~     如果文件列表包含带有前导零的数字范围，请单独使用带有大括号的构造或使用 ?.
~~~     示例：创建具有名为文件的表 file000, file001, … , file999:

hadoop01 :) CREARE TABLE big_table (
            name String, 
            value UInt32
            ) ENGINE =HDFS('hdfs://hadoop01:9000/big_dir/file{0..9}{0..9}{0..9}', 'CSV');

### --- 虚拟列

_path — Path to the file.
_file — Name of the file.

Walter Savage Landor:strove with none,for none was worth my strife.Nature I loved and, next to Nature, Art:I warm'd both hands before the fire of life.It sinks, and I am ready to depart

——W.S.Landor