Druid Concepts


官网: http://druid.io/docs/0.9.0/design/  ​​点击打开链接​

Druid is an open source data store designed for OLAP queries on event data.

Druid是一个开源的数据存储为OLAP查询事件数据而设计的。


OLAP
简写为OLAP,随着数据库技术的发展和应用,数据库存储的数据量从20世纪80年代的兆(M) 字节及千兆(G)字节过渡到现在的兆兆(T)字节和千兆兆(P)字节,同时,用户的查询需求也越来越复杂,涉及的已不仅是查询或操纵一张关系表中的一条或 几条记录,而且要对多张表中千万条记录的数据进行数据分析和信息综合,关系数据库系统已不能全部满足这一要求。在国外,不少软件厂商采取了发展其前端产品 来弥补关系数据库管理系统支持的不足,力图统一分散的公共应用逻辑,在短时间内响应非数据处理专业人员的复杂查询要求。
联机分析处理(OLAP)系统是数据仓库系 统最主要的应用,专门设计用于支持复杂的分析操作,侧重对决策人员和高层管理人员的决策支持,可以根据分析人员的要求快速、灵活地进行大数据量的复杂查询 处理,并且以一种直观而易懂的形式将查询结果提供给决策人员,以便他们准确掌握企业(公司)的经营状况,了解对象的需求,制定正确的方案。

Druid数据包括三个不同的组件,时间戳列(Timestamp column),维度列(Dimension columns),计算列(Metric columns


Druid:简介_Real



Roll-up 聚合

The individual events in our example data set are not very interesting because there may be trillions of such events. However, summarizations of this type of data can yield many useful insights. Druid summarizes this raw data at ingestion time using a process we refer to as "roll-up". Roll-up is a first-level aggregation operation over a selected set of dimensions, equivalent to (in pseudocode):


单个的事件在我们的示例数据集不是很有趣,因为可能有数万亿这样的事件。然而,这种类型的数据汇总时可以产生许多有用的见解。Druid教团员总结这原始数据在摄入时使用一个过程我们称之为“集会”。上卷是一级聚合操作在一组选定的维度,相当于(伪代码):


Druid:简介_Real_02


Sharding the Data

Druid shards are called ​​segments​

Druid数据片被称为片段,Druid总是第一个按时间分片。在压实的数据集的设置中,我们可以创建两个数据片,每小时的一个数据数据片。

Druid:简介_Real_03


Indexing the Data

Druid gets its speed in part from how it stores data. Borrowing ideas from search infrastructure, Druid creates immutable snapshots of data, stored in data structures highly optimized for analytic queries.

Druid可以快速的获取存储数据。借助搜索基础设施,Druid创建不可变的快照数据,高度优化的分析查询存储在数据结构。

Druid is a column store, which means each individual column is stored separately. Only the columns that pertain to a query are used in that query, and Druid is pretty good about only scanning exactly what it needs for a query. Different columns can also employ different compression methods. Different columns can also have different indexes associated with them.

Druid 是一个列存储,这意味着每个单独列存储。只属于一个查询的列用于查询,和Druid 很好只有扫描正是它需要为一个查询。不同的列也可以采用不同的压缩方法。不同的列也可以有不同的索引。

Druid indexes data on a per shard (segment) level.



Druid  索引数据/碎片(段)的水平。



Loading the Data



Druid has two means of ingestion, real-time and batch. Real-time ingestion in Druid is best effort. Exactly once semantics are not guaranteed with real-time ingestion in Druid, although we have it on our roadmap to support this. Batch ingestion provides exactly once guarantees and segments created via batch processing will accurately reflect the ingested data. One common approach to operating Druid is to have a real-time pipeline for recent insights, and a batch pipeline for the accurate copy of the data.

Druid有两个摄入方法实时和批量。实时摄取在Druid是最有效。完全一旦语义不能保证实时摄取在Druid,虽然我们在我们的路线图来支持这个。正好一次批摄入提供担保和段创建通过批处理将摄入的数据的准确反映。操作Druid 的一个常见方法是实时管道最近的见解,和一批管道的精确拷贝数据。

Querying the Data

Druid's native query language is JSON over HTTP, although the community has contributed query libraries in numerous languages, including SQL.

Druid is designed to perform single table operations and does not currently support joins. Many production setups do joins at ETL because data must be denormalized before loading into Druid.


Druid的原生查询语言是通过HTTP JSON,虽然社区贡献在许多语言查询库,包括SQL。Druid是为了执行单表操作,目前不支持连接。许多生产设置连接在ETL数据加载到Druid之前必须规范化。



The Druid Cluster



A Druid Cluster is composed of several different types of nodes. Each node is designed to do a small set of things very well.

Druid集群是由几种不同类型的节点。每个节点都有自己的功能。

  • Historical Nodes Historical nodes commonly form the backbone of a Druid cluster. Historical nodes download immutable segments locally and serve queries over those segments. The nodes have a shared nothing architecture and know how to load segments, drop segments, and serve queries on segments.
    历史节点集群通常形成一个Druid的支柱。本地历史节点下载不变的部分和服务查询在这段。节点有一个无共享体系结构和知道如何加载段,下降段,段和服务查询。
  • Broker Nodes Broker nodes are what clients and applications query to get data from Druid. Broker nodes are responsible for scattering queries and gathering and merging results. Broker nodes know what segments live where.
    代理节点是什么客户端和应用程序从Druid的查询来获得数据。负责代理节点散射查询和收集和合并的结果。代理节点知道段生活的地方。
  • Coordinator Nodes Coordinator nodes manage segments on historical nodes in a cluster. Coordinator nodes tell historical nodes to load new segments, drop old segments, and move segments to load balance.
    协调器节点管理段历史集群中的节点。协调器节点告诉历史节点加载新段,放弃旧的片段,片段转移到负载平衡。
  • Real-time Processing Real-time processing in Druid can currently be done using standalone realtime nodes or using the indexing service. The real-time logic is common between these two services. Real-time processing involves ingesting data, indexing the data (creating segments), and handing segments off to historical nodes. Data is queryable as soon as it is ingested by the realtime processing logic. The hand-off process is also lossless; data remains queryable throughout the entire process.
    实时处理在Druid目前可以使用独立完成实时节点或使用索引服务。这两个服务之间的实时逻辑是很常见的。实时处理包括摄取数据、索引数据(创建段),并将部分历史节点。数据是可查询尽快吸收实时处理逻辑。传球给队友的过程也是无损;数据仍可查询整个过程。



External Dependencies



Druid has a couple of external dependencies for cluster operations.

Druid有几个集群操作的外部依赖。

  • Zookeeper Druid relies on Zookeeper for intra-cluster communication.
  • Metadata Storage Druid relies on a metadata storage to store metadata about segments and configuration. Services that create segments write new entries to the metadata store and the coordinator nodes monitor the metadata store to know when new data needs to be loaded or old data needs to be dropped. The metadata store is not involved in the query path. MySQL and PostgreSQL are popular metadata stores for production, but Derby can be used for experimentation when you are running all druid nodes on a single machine.
  • Deep Storage Deep storage acts as a permanent backup of segments. Services that create segments upload segments to deep storage and historical nodes download segments from deep storage. Deep storage is not involved in the query path. S3 and HDFS are popular deep storages.



High Availability Characteristics



Druid is designed to have no single point of failure. Different node types are able to fail without impacting the services of the other node types. To run a highly available Druid cluster, you should have at least 2 nodes of every node type running.

Druid的目的是没有单点故障。不同的节点类型能够失败而不影响其他节点类型的服务。运行一个高可用性集群Druid,你应该至少2节点每个节点类型的运行。



Comprehensive Architecture



For a comprehensive look at Druid architecture, please read our ​​white paper​​.



全面看Druid 的架构,请阅读我们的白皮书。