How we built a vectorized execution engine Written by
Alfonso Subiotto Marques and Rafi Shamim on October 31, 2019
CockroachDB is an OLTP database, specialized for serving high-throughput queries that read or write a small number of rows. As we gained more usage, we found that customers weren’t getting the performance they expected from analytic queries that read a lot of rows, like large scans, joins, or aggregations. In April 2018, we started to seriously investigate how to improve the performance of these types of queries in CockroachDB, and began working on a new SQL execution engine. In this blog post, we use example code to discuss how we built the new engine and why it results in up to a 4x speed improvement on an industry-standard benchmark. CockratchDB是一个OLTP数据库,专门用于提供读取或写入少量行的高吞吐量查询。随着使用量的增加,我们发现客户无法从读取大量行的分析查询(如大型扫描、联接或聚合)中获得预期的性能。2018年4月,我们开始认真研究如何提高蟑螂数据库中这些类型查询的性能,并开始开发新的SQL执行引擎。在这篇博客文章中,我们使用示例代码来讨论我们是如何构建新引擎的,以及为什么它在行业标准基准的基础上提高了4倍的速度。
OLTP databases, including CockroachDB, store data in contiguous rows on disk and process queries a row of data at a time. This pattern is optimal for serving small queries with high throughput and low latency, since the data in the rows are stored contiguously, making it more efficient to access multiple columns from the same row. Modern OLAP databases, on the other hand, typically are better at serving large queries, and tend to store data in contiguous columns and operate on these columns using a concept called vectorized execution. Using vectorized processing in an execution engine makes more efficient use of modern CPUs by changing the data orientation (from rows to columns) to get more out of the CPU cache and deep instruction pipelines by operating on batches of data at a time. OLTP数据库,包括CockratchDB,将数据存储在磁盘上的连续行中,并一次处理一行数据的查询。这种模式最适合为具有高吞吐量和低延迟的小型查询提供服务,因为行中的数据是连续存储的,因此访问同一行的多列更高效。另一方面,现代OLAP数据库通常更擅长为大型查询提供服务,并且倾向于将数据存储在连续列中,并使用一种称为矢量化执行的概念对这些列进行操作。在执行引擎中使用矢量化处理可以通过改变数据方向(从行到列)来更有效地利用现代CPU,通过一次对多批数据进行操作,可以从CPU缓存和深度指令管道中获得更多信息。
In our research into vectorized execution, we came across MonetDB/X100: Hyper-Pipelining Query Execution, a paper that outlines the performance deficiencies of the row-at-a-time Volcano execution model that CockroachDB’s original execution engine was built on. When executing queries on a large number of rows, the row-oriented execution engine pays a high cost in interpretation and evaluation overhead per tuple and doesn’t take full advantage of the efficiencies of modern CPUs. Given the key-value storage architecture of CockroachDB, we knew we couldn’t store data in columnar format, but we wondered if converting rows to batches of columnar data after reading them from disk, and then feeding those batches into a vectorized execution engine, would improve performance enough to justify building and maintaining a new execution engine. 在我们对矢量化执行的研究中,我们遇到了MonetDB/X100:Hyper Pipelining Query execution,这篇论文概述了CockratchDB原始执行引擎所基于的一次一行的Volcano执行模型的性能缺陷,面向行的执行引擎在每个元组的解释和评估开销方面付出了高昂的代价,并且没有充分利用现代CPU的效率。考虑到CockratchDB的关键值存储体系结构,我们知道我们不能以列格式存储数据,但我们想知道,在从磁盘读取行后,将行转换为一批列数据,然后将这些批数据输入到矢量化执行引擎中,是否会提高性能,足以证明构建和维护新的执行引擎是合理的。
To quantify the performance improvements, and to test the ideas laid out in the paper, we built a vectorized execution engine prototype, which yielded some impressive results. In this tutorial-style blog post, we take a closer look at what these performance improvements look like in practice. We also demonstrate why and how we use code generation to ease the maintenance burden of the vectorized execution engine. We take an example query, analyze its performance in a toy, row-at-a-time execution engine, and then explore and implement improvements inspired by the ideas proposed in the MonetDB/x100 paper. The code referenced in this post resides in https://github.com/asubiotto/vecdeepdive, so feel free to look at, modify, and/or run the code and benchmarks while you follow along. 为了量化性能改进,并测试论文中提出的想法,我们构建了一个矢量化执行引擎原型,该原型产生了一些令人印象深刻的结果。在这篇教程风格的博客文章中,我们将深入了解这些性能改进在实践中的表现。我们还演示了为什么以及如何使用代码生成来减轻矢量化执行引擎的维护负担。我们以一个查询为例,分析了它在玩具、一次一行的执行引擎中的性能,然后探索并实现了受MonetDB/x100论文中提出的思想启发的改进。本文中引用的代码位于https://github.com/asubiotto/vecdeepdive,因此,在进行后续操作时,可以随意查看、修改和/或运行代码和基准测试。
What’s in a SQL operator?
To provide some context, let’s look at how CockroachDB executes a simple query, SELECT price * 0.8 FROM inventory, issued by a fictional retail customer that wants to compute a discounted price for each item in her inventory. Regardless of which execution engine is used, this query is parsed, converted into an abstract syntax tree (AST), optimized, and then executed. The execution, whether distributed amongst all nodes in a cluster, or executed locally, can be thought of as a chain of data manipulations that each have a specific role, which we call operators. In this example query, the execution flow would look like this: 为了提供一些上下文,让我们看看CockratchDB如何执行一个简单的查询,SELECT price*0.8 FROM inventory,该查询由一个虚构的零售客户发布,该客户希望计算其库存中每件商品的折扣价。无论使用哪种执行引擎,都会对该查询进行解析、转换为抽象语法树(AST)、优化,然后执行。执行,无论是分布在集群中的所有节点之间,还是在本地执行,都可以被视为一系列数据操作,每个操作都有特定的角色,我们称之为运算符。在这个示例查询中,执行流如下所示:
You can generate a diagram of the physical plan by executing EXPLAIN (DISTSQL)on the query. As you can see, the execution flow for this query is relatively simple. The TableReader operator reads rows from the inventory table and then executes a post-processing render expression, in this case the multiplication by a constant float. Let’s focus on the render expression, since it’s the part of the flow that is doing the most work. 您可以通过对查询执行EXPLAIN(DISTSQL)来生成物理计划的图表。正如您所看到的,这个查询的执行流程相对简单。TableReader操作符从库存表中读取行,然后执行后处理呈现表达式,在本例中是与常量float相乘。让我们关注渲染表达式,因为它是流中完成最多工作的部分。
Here’s the code that executes this render expression in the original, row-oriented execution engine used in CockroachDB (some code is omitted here for simplicity): 以下是在CockratchDB中使用的原始面向行执行引擎中执行此呈现表达式的代码(为了简单起见,此处省略了一些代码):
func (expr *BinaryExpr) Eval(ctx *EvalContext) (Datum, error) {
left, err := expr.Left.(TypedExpr).Eval(ctx)
if err != nil {
return nil, err
}
right, err := expr.Right.(TypedExpr).Eval(ctx)
if err != nil {
return nil, err
}
return expr.fn.Fn(ctx, left, right)
}
The left and right side of the binary expression (BinaryExpr) are both values wrapped in a Datum interface. The BinaryExpr calls expr.fn.Fn with both of these as arguments. In our example, the inventory table has a FLOAT price column, so the Fnis: 二进制表达式(BinaryExpr)的左侧和右侧都是封装在Datum接口中的值。BinaryExpr调用expr.fn.fn,并将两者作为参数。在我们的示例中,库存表有一个FLOAT价格列,因此Fnis:
Fn: func(_ *EvalContext, left Datum, right Datum) (Datum, error) {
return NewDFloat(*left.(*DFloat) * *right.(*DFloat)), nil
}
In order to perform the multiplication, the Datum values need to be converted to the expected type. If, instead, we created a price column of type DECIMAL, we would cast 0.8 to a DECIMAL and then construct a BinaryExpr with a different Fn specialized for multiplying DECIMALs. 为了执行乘法运算,需要将基准值转换为预期类型。相反,如果我们创建了一个DECIMAL类型的price列,我们将把0.8强制转换为DECIMAL,然后用不同的Fn构造一个BinaryExpr,该Fn专门用于乘以DECIMAL。
We now have specialized code for multiplying each type, but the TableReader doesn’t need to worry about it. Before executing the query, the database creates a query plan that specifies the correct Fn for the type that we are working with. This simplifies the code, since we only need to write specialized code as an implementation of an interface. It also makes the code less efficient, as each time we multiply two values together, we need to dynamically resolve which Fn to call, cast the interface values to concrete type values that we can work with, and then convert the result back to an interface value. 我们现在有了专门的代码来乘以每种类型,但TableReader不需要担心。在执行查询之前,数据库会创建一个查询计划,为我们使用的类型指定正确的Fn。这简化了代码,因为我们只需要编写专门的代码作为接口的实现。这也降低了代码的效率,因为每次我们将两个值相乘时,我们都需要动态解析要调用哪个Fn,将接口值强制转换为我们可以使用的具体类型值,然后将结果转换回接口值。
Benchmarking a simple operator
How expensive is this casting, really? To find the answer to this question, let’s take a similar but simpler toy example: 这个铸件真的有多贵?为了找到这个问题的答案,让我们举一个类似但更简单的玩具例子:
type Datum interface{}
// Int implements the Datum interface.
type Int struct {
int64
}
func mulIntDatums(a Datum, b Datum) Datum {
aInt := a.(Int).int64
bInt := b.(Int).int64
return Int{int64: aInt * bInt}
}
// ...
func (m mulOperator) next() []Datum {
row := m.input.next()
if row == nil {
return nil
}
for _, c := range m.columnsToMultiply {
row[c] = m.fn(row[c], m.arg)
}
return row
}
This is a type-agnostic single operator that can handle multiplication of an arbitrary number of columns by a constant argument. Think of the input as returning the rows from the table. To add support for DECIMALs, we can simply add another function that multiplies DECIMALs with a mulFn signature.
这是一个类型不可知的单个运算符,可以处理任意列数与常量参数的乘积。将输入视为从表中返回行。为了增加对DECIMAL的支持,我们可以简单地添加另一个函数,将DECIMAL与mulFn签名相乘。
We can measure the performance of this code by writing a benchmark (see it in our repo). This will give us an idea of how fast we can multiply a large number of Int rows by a constant argument. The benchstat tool tells us that it takes around 760 microseconds to do this:
我们可以通过编写基准来衡量此代码的性能(请参阅我们的回购)。这将使我们了解用常量参数乘以大量Int行的速度。benchstat工具告诉我们,这需要大约760微秒才能完成:
$ go test -bench “BenchmarkRowBasedInterface$” -count 10 > tmp && benchstat tmp && rm tmp
name time/op
RowBasedInterface-12 760µs ±15%
Because we have nothing to compare the performance against at this point, we don’t know if this is slow or not.
因为在这一点上,我们没有什么可以比较的表现,我们不知道这是否缓慢。
We’ll use a “speed of light” benchmark to get a better relative sense of this program’s speed. A “speed of light” benchmark measures the performance of the minimum necessary work to perform an operation. In this case, what we really are doing is multiplying 65,536 int64s by 2. The result of running this benchmark is:
我们将使用“光速”基准来更好地了解这个程序的速度。“光速”基准衡量执行操作所需的最低工作量的性能。在这种情况下,我们真正要做的是将65536 int64乘以2。运行此基准测试的结果是:
$ go test -bench "SpeedOfLight" -count 10 > tmp && benchstat tmp && rm tmp
name time/op
SpeedOfLight-12 19.0µs ± 6%
This simple implementation is about 40x faster than our earlier operator! 这个简单的实现比我们早期的运营商快大约40倍!
To try to figure out what’s going on, let’s run a CPU profile on BenchmarkRowBasedInterface and focus on the mulOperator. We can use the -o option to obtain an executable, which will let us disassemble the function with the disasm command in pprof. As we will see below, this command will give us the assembly that our Go source code compiles into, along with approximate CPU times for each instruction. First, let’s use the top and list commands to find the slow parts of the code.
为了弄清楚发生了什么,让我们在BenchmarkRowBasedInterface上运行一个CPU配置文件,并关注mulOperator。我们可以使用-o选项来获得一个可执行文件,这将使我们可以使用approf中的disasm命令来反汇编函数。正如我们将在下面看到的,这个命令将为我们提供Go源代码编译到的程序集,以及每条指令的大致CPU时间。首先,让我们使用top和list命令来查找代码中较慢的部分。
$ go test -bench "BenchmarkRowBasedInterface$" -cpuprofile cpu.out -o row_based_interface
…
$ go tool pprof ./row_based_interface cpu.out
(pprof) focus=mulOperator
(pprof) top
Active filters:
focus=mulOperator
Showing nodes accounting for 1.99s, 88.05% of 2.26s total
Dropped 15 nodes (cum <= 0.01s)
Showing top 10 nodes out of 12
flat flat% sum% cum cum%
0.93s 41.15% 41.15% 2.03s 89.82% _~/scratch/vecdeepdive.mulOperator.next
0.47s 20.80% 61.95% 0.73s 32.30% _~/scratch/vecdeepdive.mulIntDatums
0.36s 15.93% 77.88% 0.36s 15.93% _~/scratch/vecdeepdive.(*tableReader).next
0.16s 7.08% 84.96% 0.26s 11.50% runtime.convT64
0.07s 3.10% 88.05% 0.10s 4.42% runtime.mallocgc
0 0% 88.05% 2.03s 89.82% _~/scratch/vecdeepdive.BenchmarkRowBasedInterface
0 0% 88.05% 0.03s 1.33% runtime.(*mcache).nextFree
0 0% 88.05% 0.02s 0.88% runtime.(*mcache).refill
0 0% 88.05% 0.02s 0.88% runtime.(*mcentral).cacheSpan
0 0% 88.05% 0.02s 0.88% runtime.(*mcentral).grow
(pprof) list next
ROUTINE ======================== _~/scratch/vecdeepdive.mulOperator.next in ~/scratch/vecdeepdive/row_based_interface.go
930ms 2.03s (flat, cum) 89.82% of Total
. . 39:
60ms 60ms 40:func (m mulOperator) next() []Datum {
120ms 480ms 41: row := m.input.next()
50ms 50ms 42: if row == nil {
. . 43: return nil
. . 44: }
250ms 250ms 45: for _, c := range m.columnsToMultiply {
420ms 1.16s 46: row[c] = m.fn(row[c], m.arg)
. . 47: }
30ms 30ms 48: return row
. . 49:}
. . 50:
We can see that out of 2030ms, the mulOperator spends 480ms getting rows from the input, and 1160ms performing the multiplication. 420ms of those are spent in next before even calling m.fn (the left column is the flat time, i.e., time spent on that line, while the right column is the cumulative time, which also includes the time spent in the function called on that line). Since it seems like the majority of time is spent multiplying arguments, let’s take a closer look at mulIntDatums: 我们可以看到,在2030ms中,mulOperator从输入中获取行的时间为480ms,执行乘法的时间为1160ms。其中420ms是在调用m.fn之前在next中花费的(左列是平坦时间,即在该行上花费的时间,而右列是累积时间,也包括在该行调用的函数中花费的时间)。由于大部分时间似乎都花在了增加参数上,让我们仔细看看mulIntDatums:
(pprof) list mulIntDatums
Total: 2.26s
ROUTINE ======================== _~/scratch/vecdeepdive.mulIntDatums in ~/scratch/vecdeepdive/row_based_interface.go
470ms 730ms (flat, cum) 32.30% of Total
. . 10:
70ms 70ms 11:func mulIntDatums(a Datum, b Datum) Datum {
20ms 20ms 12: aInt := a.(Int).int64
90ms 90ms 13: bInt := b.(Int).int64
290ms 550ms 14: return Int{int64: aInt * bInt}
. . 15:}
As expected, the majority of the time spent in mulIntDatums is on the multiplication line. Let’s take a closer look at what’s going on under the hood here by using the disasm (disassemble) command (some instructions are omitted): 不出所料,在mulIntDatums中花费的大部分时间都在乘法线上。让我们使用disasm(反汇编)命令(省略了一些说明)来更仔细地了解一下引擎盖下发生了什么:
(pprof) disasm mulIntDatums
. . 1173491: MOVQ 0x28(SP), AX ;row_based_interface.go:12
20ms 20ms 1173496: LEAQ type.*+228800(SB), CX ;_~/scratch/vecdeepdive.mulIntDatums row_based_interface.go:12
. . 117349d: CMPQ CX, AX ;row_based_interface.go:12
. . 11734a0: JNE 0x1173505
. . 11734a2: MOVQ 0x30(SP), AX
. . 11734a7: MOVQ 0(AX), AX
90ms 90ms 11734aa: MOVQ 0x38(SP), DX ;_~/scratch/vecdeepdive.mulIntDatums row_based_interface.go:13
. . 11734af: CMPQ CX, DX ;row_based_interface.go:13
. . 11734b2: JNE 0x11734e9
. . 11734b4: MOVQ 0x40(SP), CX
. . 11734b9: MOVQ 0(CX), CX
70ms 70ms 11734bc: IMULQ CX, AX ;_~/scratch/vecdeepdive.mulIntDatums row_based_interface.go:14
60ms 60ms 11734c0: MOVQ AX, 0(SP)
90ms 350ms 11734c4: CALL runtime.convT64(SB)
Surprisingly, only 70ms is spent executing the IMULQ instruction, which is the instruction that ultimately performs the multiplication. The majority of the time is spent calling convT64, which is a Go runtime package function that is used (in this case) to convert the Int type to the Datum interface.
令人惊讶的是,执行IMULQ指令只花费了70毫秒,这是最终执行乘法的指令。大部分时间都花在调用convT64上,这是一个Go运行时包函数,用于(在本例中)将Int类型转换为Datum接口。
The disassembled view of the functions suggests that most of the time spent multiplying values is converting the arguments from Datums to Ints and the result from an Int back to a Datum.
函数的分解视图表明,将值相乘所花费的大部分时间是将参数从Datums转换为Ints,并将结果从Int转换回Datum。
Using concrete types
To avoid the overhead of these conversions, we would need to work with concrete types. This is a tough spot to be in, since the execution engine we’ve been discussing uses interfaces to be type-agnostic. Without using interfaces, each operator would need to have knowledge about the type it is working with. In other words, we would need to implement an operator for each type. 为了避免这些转换的开销,我们需要使用具体的类型。这是一个很难解决的问题,因为我们一直在讨论的执行引擎使用与类型无关的接口。如果不使用接口,每个操作员都需要了解所使用的类型。换句话说,我们需要为每种类型实现一个运算符。
Luckily, we have the prior research of the MonetDB team to guide us. Given their work, we knew that the pain caused by removing the interfaces would be justified by huge potential performance improvements. 幸运的是,我们有MonetDB团队之前的研究来指导我们。考虑到他们的工作,我们知道移除接口所带来的痛苦将通过巨大的潜在性能改进来证明。
Later, we will take a look at how we got away with using concretely-typed operators to avoid typecasts for performance reasons, without sacrificing all of the maintainability that comes from using Go’s type-agnostic interfaces. First, let’s look at what will replace the Datum interface: 稍后,我们将了解如何在不牺牲使用Go的类型无关接口所带来的所有可维护性的情况下,使用具体类型的运算符来避免出于性能原因的类型转换。首先,让我们看看什么将取代Datum界面:
type T int
const (
// Int64Type is a value of type int64
Int64Type T = iota
// Float64Type is a value of type float64
Float64Type
)
type TypedDatum struct {
t T
int64 int64
float64 float64
}
type TypedOperator interface {
next() []TypedDatum
}
A Datum now has a field for each possible type it may contain, rather than having separate interface implementations for each type. There is an additional enum field that serves as a type marker, so that when we do need to, we can inspect a type of a Datum without doing any expensive type assertions. This type uses extra memory due to having a field for each type, even though only one of them will be used at a time. This could lead to CPU cache inefficiencies, but for this section we will skip over those concerns and focus on dealing with the interface interpretation overhead. In a later section, we’ll discuss the inefficiency more and address it.
Datum现在为它可能包含的每种可能的类型都有一个字段,而不是为每种类型都有单独的接口实现。还有一个额外的枚举字段用作类型标记,因此当我们确实需要时,我们可以检查Datum的类型,而无需进行任何昂贵的类型断言。由于每个类型都有一个字段,即使一次只使用其中一个,这种类型也会使用额外的内存。这可能会导致CPU缓存效率低下,但在本节中,我们将跳过这些问题,重点讨论接口解释开销。在后面的部分中,我们将更多地讨论低效问题并加以解决。
The mulInt64Operator will now look like this:mulInt64Operator现在将如下所示:
func (m mulInt64Operator) next() []TypedDatum {
row := m.input.next()
if row == nil {
return nil
}
for _, c := range m.columnsToMultiply {
row[c].int64*= m.arg
}
return row
}
Note that the multiplication is now in place. Running the benchmark against this new version shows almost a 2x speed up. 请注意,乘法运算现在已经到位。与这个新版本相比,运行基准测试显示速度几乎提高了2倍。
$ go test -bench "BenchmarkRowBasedTyped$" -count 10 > tmp && benchstat tmp && rm tmp
name time/op
RowBasedTyped-12 390µs ± 8%
However, now that we are writing specialized operators for each type, the amount of code we have to write has nearly doubled, and even worse, the code violates the maintainability principle of staying DRY (Don’t Repeat Yourself). The situation seems even worse if we consider that in a real database engine, there would be far more than two types to support. If someone were to slightly change the multiplication functionality (for example, adding overflow handling), they would have to rewrite every single operator, which is tedious and error-prone. The more types, the more work one has to do to update code. 然而,现在我们正在为每种类型编写专门的运算符,我们必须编写的代码量几乎翻了一番,更糟糕的是,代码违反了保持DRY(不要重复自己)的可维护性原则。如果我们考虑到在一个真正的数据库引擎中,需要支持的类型远远不止两种,情况似乎会更糟。如果有人稍微改变乘法功能(例如,添加溢出处理),他们将不得不重写每一个运算符,这既乏味又容易出错。类型越多,更新代码所要做的工作就越多。
Generating code with templates
Thankfully, there is a tool we can use to reduce this burden and keep the good performance characteristics of working with concrete types. The Go templating engine allows us to write a code template that, with a bit of work, we can trick our editor into treating as a regular Go file. We have to use the templating engine because the version of Go we are currently using does not have support for generic types. Templating the multiplication operators would look like this (full template code is in row_based_typed_tmpl.go): 值得庆幸的是,我们可以使用一种工具来减轻这种负担,并保持使用混凝土类型时的良好性能特征。Go模板引擎允许我们编写一个代码模板,只要做一点工作,我们就可以欺骗编辑器将其视为常规Go文件。我们必须使用模板引擎,因为我们目前使用的Go版本不支持泛型类型。将乘法运算符模板化如下(完整的模板代码位于row_based_typed_tmpl.go中):
// {{/*
type _GOTYPE interface{}
// _MULFN assigns the result of the multiplication of the first and second
// operand to the first operand.
func _MULFN(_ TypedDatum, _ interface{}) {
panic("do not call from non-templated code")
}
// */}}
// {{ range .}}
type mul_TYPEOperator struct {
input TypedOperator
arg _GOTYPE
columnsToMultiply []int
}
func (m mul_TYPEOperator) next() []TypedDatum {
row := m.input.next()
if row == nil {
return nil
}
for _, c := range m.columnsToMultiply {
_MULFN(row[c], m.arg)
}
return row
}
// {{ end }}
The accompanying code to generate the full row_based_typed.gen.go file is located in row_based_type_gen.go. This code is executed by running go run . to run the main() function in generate.go (omitted here for conciseness). The generator will iterate over a slice and fill the template in with specific information for each type. Note that there is a prior step that is necessary in order to consider the row_based_typed_tmpl.go file valid Go. In the template, we use tokens that are valid Go (e.g. _GOTYPE and _MULFN). These tokens’ declarations are wrapped in template comments and removed in the final generated file. 生成完整row_based_typed.gen.go文件的附带代码位于row_based_type_gen.go中。此代码通过运行go run来执行。运行generate.go中的main()函数(为了简洁起见,此处省略)。生成器将遍历一个切片,并在模板中填充每种类型的特定信息。请注意,为了将row_based_typed_tmpl.go文件视为有效go,前面有一个步骤是必要的。在模板中,我们使用有效Go的令牌(例如_GOTYPE和_MULFN)。这些标记的声明封装在模板注释中,并在最终生成的文件中删除。
For example, the multiplication function (_MULFN) is converted to a method call with the same arguments: 例如,乘法函数(_MULFN)被转换为具有相同参数的方法调用:
// Replace all functions.
mulFnRe := regexp.MustCompile(`_MULFN\((.*),(.*)\)`)
s = mulFnRe.ReplaceAllString(s, `{{ .MulFn "$1" "$2" }}`)
MulFn is called when executing the template, and then returns the Go code to perform the multiplication according to type-specific information. Take a look at the final generated code in row_based_typed.gen.go. 执行模板时调用MulFn,然后返回Go代码,根据特定类型的信息执行乘法。看看row_based_typed.gen.go中最终生成的代码。
The templating approach we took has some rough edges, and certainly is not a very flexible implementation. Nonetheless, it is a critical part of the real vectorized execution engine that we built in CockroachDB, and it was simple enough to build without getting sidetracked by creating a robust domain-specific language. Now, if we want to add functionality or fix a bug, we can modify the template once and regenerate the code for changes to all operators. Now that the code is a little more manageable and extensible, let’s try to improve the performance further. 我们采用的模板方法有一些粗糙的边缘,当然不是一个非常灵活的实现。尽管如此,它是我们在CockratchDB中构建的真正的矢量化执行引擎的关键部分,而且它的构建非常简单,不会因为创建了一种强大的领域特定语言而偏离方向。现在,如果我们想添加功能或修复错误,我们可以修改模板一次,然后重新生成代码以更改所有运算符。现在,代码变得更易于管理和扩展了,让我们尝试进一步提高性能。
NOTE: To make the code in the rest of this blog post easier to read, we won’t use code generation for the following operator rewrites. 注意:为了让这篇博客文章其余部分的代码更容易阅读,我们不会在以下运算符重写中使用代码生成。
Batching expensive calls
Repeating our benchmarking process from before shows us some useful next steps.
$ go test -bench "BenchmarkRowBasedTyped$" -cpuprofile cpu.out -o row_typed_bench
$ go tool pprof ./row_typed_bench cpu.out
(pprof) list next
ROUTINE ======================== _~/scratch/vecdeepdive.mulInt64Operator.next in ~/scratch/vecdeepdive/row_based_typed.gen.go
1.26s 1.92s (flat, cum) 85.71% of Total
. . 8: input TypedOperator
. . 9: arg int64
. . 10: columnsToMultiply []int
. . 11:}
. . 12:
180ms 180ms 13:func (m mulInt64Operator) next() []TypedDatum {
170ms 830ms 14: row := m.input.next()
. . 15: if row == nil {
. . 16: return nil
. . 17: }
330ms 330ms 18: for _, c := range m.columnsToMultiply {
500ms 500ms 19: row[c].int64*= m.arg
. . 20: }
80ms 80ms 21: return row
. . 22:}
This part of the profile shows that approximately half of the time spent in the mulInt64Operator.next function is spent calling m.input.next() (see line 13 above). This isn’t surprising if we look at our implementation of (*typedTableReader).next(); it’s a lot of code just for advancing to the next element in a slice. We can’t optimize too much about the typedTableReader, since we need to preserve the ability for it to be chained to any other SQL operator that we may implement. But there is another important optimization that we can do: instead of calling the next function once for each row, we can get back a batch of rows and operate on all of them at once, without changing too much about (*typedTableReader).next. We can’t just get all the rows at once, because some queries might result in a huge dataset that won’t fit in memory, but we can pick a reasonably large batch size. 概要文件的这一部分显示,在mulInt64Operator.next函数中花费的时间大约有一半用于调用m.input.next()(请参见上面的第13行)。如果我们看看(*typedTableReader).next();这是大量的代码,只是为了推进到切片中的下一个元素。我们不能对typedTableReader进行太多优化,因为我们需要保留将其链接到我们可能实现的任何其他SQL运算符的能力。但我们还可以做另一个重要的优化:我们不必为每一行调用下一个函数一次,而是可以一次返回一批行并对所有行进行操作,而无需对(*typedTableReader)做太多更改。next。我们不能一次获取所有行,因为有些查询可能会导致一个巨大的数据集无法放入内存,但我们可以选择一个相当大的批量。
With this optimization, we have operators like the ones below. Once again, the full code for this new version is omitted, since there’s a lot of boilerplate changes. Full code examples can be found in row_based_typed_batch.go. 通过这种优化,我们有了像下面这样的运算符。再一次,这个新版本的完整代码被省略了,因为有很多样板更改。完整的代码示例可以在row_based_typed_batch.go中找到。
type mulInt64BatchOperator struct {
input TypedBatchOperator
arg int64
columnsToMultiply []int
}
func (m mulInt64BatchOperator) next() [][]TypedDatum {
rows := m.input.next()
if rows == nil {
return nil
}
for _, row := range rows {
for _, c := range m.columnsToMultiply {
row[c] = TypedDatum{t: Int64Type, int64: row[c].int64 * m.arg}
}
}
return rows
}
type typedBatchTableReader struct {
curIdx int
rows [][]TypedDatum
}
func (t *typedBatchTableReader) next() [][]TypedDatum {
if t.curIdx >= len(t.rows) {
return nil
}
endIdx := t.curIdx + batchSize
if endIdx > len(t.rows) {
endIdx = len(t.rows)
}
retRows := t.rows[t.curIdx:endIdx]
t.curIdx = endIdx
return retRows
}
With this batching change, the benchmarks run nearly 3x faster (and 5.5x faster than the original implementation): 通过这种批处理更改,基准测试的运行速度快了近3倍(比原始实现快了5.5倍):
$ go test -bench "BenchmarkRowBasedTypedBatch$" -count 10 > tmp && benchstat tmp && rm tmp
name time/op
RowBasedTypedBatch-12 137µs ±77%
Column-oriented Data
But we are still a long ways away from getting close to our “speed of light” performance of 19 microseconds per operation. Does the new profile give us more clues? 但我们离接近每次操作19微秒的“光速”性能还有很长的路要走。新的简介给了我们更多的线索吗?
$ go test -bench "BenchmarkRowBasedTypedBatch" -cpuprofile cpu.out -o row_typed_batch_bench
$ go tool pprof ./row_typed_batch_bench cpu.out
(pprof) list next
Total: 990ms
ROUTINE ======================== _~/scratch/vecdeepdive.mulInt64BatchOperator.next in ~/scratch/vecdeepdive/row_based_typed_batch.go
950ms 950ms (flat, cum) 95.96% of Total
. . 15:func (m mulInt64BatchOperator) next() [][]TypedDatum {
. . 16: rows := m.input.next()
. . 17: if rows == nil {
. . 18: return nil
. . 19: }
210ms 210ms 20: for _, row := range rows {
300ms 300ms 21: for _, c := range m.columnsToMultiply {
440ms 440ms 22: row[c] = TypedDatum{t: Int64Type, int64: row[c].int64 * m.arg}
. . 23: }
. . 24: }
. . 25: return rows
. . 26:}
Now the time calling (*typedBatchTableReader).next barely registers in the profile! That is much better. The profile shows that lines 20-22 is probably the best place to focus our efforts next. These lines are where well above 95% of the time is spent. That is partially a good sign, because these lines are implementing the core logic of our operator. 现在调用(*typedBatchTableReader).next的时间几乎没有在配置文件中注册!这样好多了。简介显示,第20-22行可能是我们下一步集中精力的最佳位置。这些线路花费了95%以上的时间。这在一定程度上是一个好兆头,因为这些线路正在实现我们运营商的核心逻辑。
However, there certainly is still room for improvement. Approximately half of the time spent in these three lines is just in iterating through the loops, and not in the loop body itself. If we think about the sizes of the loops, then this starts to become more clear. The length of the rows batch is 1,024, but the length of columnsToMultiply is just 1. Since the rows loop is the outer loop, this means that we are setting up this tiny inner loop – initializing a counter, incrementing it, and checking the boundary condition – 1,024 times! We could avoid all that repeated work simply by changing the order of the two loops. 然而,肯定还有改进的余地。在这三行中花费的时间大约有一半只是在循环中迭代,而不是在循环体本身。如果我们考虑一下环的大小,这一点就会变得更加清楚。行批的长度为1024,但列ToMultiply的长度仅为1。由于rows循环是外循环,这意味着我们正在设置这个微小的内循环——初始化计数器,递增计数器,并检查边界条件——1024次!我们可以通过改变两个循环的顺序来避免所有重复的工作。
Although we won’t go into a full exploration of CPU architecture in this post, there are two important concepts that come into play when changing the loop order: branch prediction and pipelining. In order to speed up execution, CPUs use a technique called pipelining to begin executing the next instruction before the preceding one is completed. This works well in the case of sequential code, but whenever there are conditional branches, the CPU cannot identify with certainty what the next instruction after the branch will be. However, it can make a guess as to which branch will be followed. If the CPU guesses incorrectly, the work that the CPU has already performed to begin evaluating the next instruction will go to waste. Modern CPUs are able to make predictions based on static code analysis, and even the results of previous evaluations of the same branch. 尽管我们在这篇文章中不会对CPU架构进行全面的探索,但在更改循环顺序时,有两个重要的概念会发挥作用:分支预测和流水线。为了加快执行速度,CPU使用一种名为流水线的技术,在前一条指令完成之前开始执行下一条指令。这在顺序代码的情况下效果很好,但只要有条件分支,CPU就无法确定分支后的下一条指令是什么。然而,它可以猜测将遵循哪个分支。如果CPU猜测错误,那么CPU为开始评估下一条指令所做的工作将付诸东流。现代CPU能够基于静态代码分析,甚至基于同一分支先前评估的结果进行预测。
Changing the order of the loops comes with another benefit. Since the outer loop will now tell us which column to operate on, we can load all the data for that column at once, and store it in memory in one contiguous slice. A critical component of modern CPU architecture is the cache subsystem. In order to avoid loading data from main memory too often, which is a relatively slow operation, CPUs have layers of caches that provide fast access to frequently used pieces of data, and they can also prefetch data into these caches if the access pattern is predictable. In the row based example, we would load all the data for each row, which would include columns that were not at all affected by the operator, so not as much relevant data would fit into the CPU cache. Orienting the data we are going to operate on by column provides a CPU with exactly the predictability and dense memory-packing that it needs to make ideal use of its caches. 改变循环的顺序还有另一个好处。由于外部循环现在将告诉我们对哪一列进行操作,因此我们可以一次加载该列的所有数据,并将其存储在内存中的一个连续切片中。现代CPU体系结构的一个关键组件是缓存子系统。为了避免过于频繁地从主存储器加载数据(这是一种相对较慢的操作),CPU具有缓存层,可以快速访问频繁使用的数据,如果访问模式是可预测的,它们还可以将数据预取到这些缓存中。在基于行的示例中,我们将加载每行的所有数据,其中包括完全不受运算符影响的列,因此CPU缓存中不会容纳那么多相关数据。按列对我们将要操作的数据进行定向,为CPU提供了理想地利用其缓存所需的可预测性和密集的内存封装。
For a fuller treatment of pipelining, branch prediction, and CPU caches see Dan Luu’s branch prediction talk notes, his CPU cache blog post, or Dave Cheney’s notes from his High Performance Go Workshop. 要更全面地处理流水线、分支预测和CPU缓存,请参阅Dan Luu的分支预测演讲笔记、他的CPU缓存博客文章或Dave Cheney的高性能Go Workshop笔记。
The code below shows how we could make the loop and data orientation changes described above, and also define a few new types at the same time to make the code easier to work with. 下面的代码展示了我们如何对上面描述的循环和数据方向进行更改,并同时定义一些新类型,使代码更易于使用。
type vector interface {
// Type returns the type of data stored in this vector.
Type() T
// Int64 returns an int64 slice.
Int64() []int64
// Float64 returns a float64 slice.
Float64() []float64
}
type colBatch struct {
size int
vecs []vector
}
func (m mulInt64ColOperator) next() colBatch {
batch := m.input.next()
if batch.size == 0 {
return batch
}
for _, c := range m.columnsToMultiply {
vec := batch.vecs[c].Int64()
for i := range vec {
vec[i] = vec[i] * m.arg
}
}
return batch
}
The reason we introduced the new vector type is so that we could have one struct that could represent a batch of data of any type. The struct has a slice field for each type, but only one of these slices will ever be non-nil. You may have noticed that we have now re-introduced some interface conversion, but the performance price we pay for it is now amortized thanks to batching. Let’s take a look at the benchmark now. 我们引入新的向量类型的原因是,我们可以有一个结构来表示任何类型的一批数据。结构为每种类型都有一个切片字段,但这些切片中只有一个是非零的。您可能已经注意到,我们现在重新引入了一些接口转换,但由于批处理,我们为其支付的性能价格现在已经摊销。现在让我们来看一下基准。
$ go test -bench "BenchmarkColBasedTyped" -count 10 > tmp && benchstat tmp && rm tmp
name time/op
ColBasedTyped-12 38.2µs ±24%
This is another nearly ~3.5x improvement, and a ~20x improvement over the original row-at-a-time version! Our speed of light benchmark is still about 2x faster than this latest version, since there is overhead in reading each batch and navigating to the columns on which to operate. For the purposes of this post, we will stop our optimization efforts here, but we are always looking for ways to make our real vectorized engine faster. 这是另一个近3.5倍的改进,比最初的一次一行的版本改进了约20倍!我们的光速基准测试仍然比最新版本快约2倍,因为读取每个批次和导航到要操作的列都会有开销。出于这篇文章的目的,我们将在这里停止我们的优化工作,但我们一直在寻找使我们真正的矢量化引擎更快的方法。
Conclusion
By analyzing the profiles of our toy execution engine’s code and employing the ideas proposed in the MonetDB/x100 paper, we were able to identify performance problems and implement solutions that improved the performance of multiplying 65,536 rows by a factor of 20x. We also used code generation to write templated code that is then generated into specific implementations for each concrete type. 通过分析我们的玩具执行引擎代码的概要文件,并采用MonetDB/x100论文中提出的思想,我们能够识别性能问题,并实现将65536行乘以20倍的性能提高的解决方案。我们还使用代码生成来编写模板化代码,然后将其生成到每个具体类型的特定实现中。
In CockroachDB, we incorporated all of the changes presented in this blog post into our vectorized execution engine. This resulted in improving the CPU time of our own microbenchmarks by up to 70x, and the end-to-end latency of some queries in the industry-standard TPC-H benchmark by as much as 4x. The end-to-end latency improvement we achieved is a lot smaller than the improvement achieved in our toy example, but note that we only focused on improving the in-memory execution of a query in this blog post. When running TPC-H queries on CockroachDB, data needs to be read from disk in its original row-oriented format before processing, which will account for the lion’s share of the query’s execution latency. Nevertheless, this is a great improvement. 在CockratchDB中,我们将这篇博客文章中所做的所有更改都整合到了我们的矢量化执行引擎中。这使得我们自己的微基准测试的CPU时间提高了70倍,行业标准TPC-H基准测试中一些查询的端到端延迟提高了4倍。我们实现的端到端延迟改进比玩具示例中实现的改进要小得多,但请注意,在这篇博客文章中,我们只关注于改进查询的内存执行。在CockratchDB上运行TPC-H查询时,在处理之前,需要以原始的面向行格式从磁盘读取数据,这将占查询执行延迟的大部分。尽管如此,这是一个很大的改进。
In CockroachDB 19.2, you will be able to enjoy these performance benefits on many common scan, join and aggregation queries. Here’s a demonstration of the original sample query from this blog post, which runs nearly 2 times as fast with our new vectorized engine: 在蟑螂数据库19.2中,您将能够在许多常见的扫描、连接和聚合查询中享受这些性能优势。以下是这篇博客文章中的原始示例查询的演示,它的运行速度是我们新的矢量化引擎的近2倍:
oot@127.0.0.1:64128/defaultdb> CREATE TABLE inventory (id INT PRIMARY KEY, price FLOAT);
CREATE TABLE
Time: 2.78ms
root@127.0.0.1:64128/defaultdb> INSERT INTO inventory SELECT id, random()*10 FROM generate_series(1,10000000) g(id);
INSERT 100000
Time: 521.757ms
root@127.0.0.1:64128/defaultdb> EXPLAIN SELECT count(*) FROM inventory WHERE price * 0.8 > 3;
tree | field | description
+----------------+-------------+---------------------+
| distributed | true
| vectorized | true
group | |
│ | aggregate 0 | count_rows()
│ | scalar |
└── render | |
└── scan | |
| table | inventory@primary
| spans | ALL
| filter | (price * 0.8) > 3.0
(10 rows)
Time: 3.076ms
The EXPLAIN plan for this query shows that the vectorized field is true, which means that the query will be run with the vectorized engine by default. And, sure enough, running this query with the engine on and off shows a modest performance difference: 此查询的EXPLAIN计划显示矢量化字段为true,这意味着默认情况下将使用矢量化引擎运行查询。毫无疑问,在引擎打开和关闭的情况下运行此查询显示出适度的性能差异:
root@127.0.0.1:64128/defaultdb> SELECT count(*) FROM inventory WHERE price * 0.8 > 3;
count
+---------+
6252335
(1 row)
Time: 3.587261s
root@127.0.0.1:64128/defaultdb> set vectorize=off;
SET
Time: 283µs
root@127.0.0.1:64128/defaultdb> SELECT count(*) FROM inventory WHERE price * 0.8 > 3;
count
+---------+
6252335
(1 row)
Time: 5.847703s
In CockroachDB 19.2, the new vectorized engine is automatically enabled for supported queries that are likely to read more rows than the vectorize_row_count_threshold setting (which defaults to 1,024). Queries with buffering operators that could potentially use an unbounded amount of memory (like global sorts, hash joins, and unordered aggregations) are implemented but not yet enabled by default. For full details of what is and isn’t on by default, check out the vectorized execution engine docs. And to learn more about how we built more complicated vectorized operators check out our blog posts on the vectorized hash joiner and the vectorized merge joiner. 在蟑螂数据库19.2中,新的矢量化引擎会自动为支持的查询启用,这些查询可能读取比矢量化_row_count_threshold设置(默认为1024)更多的行。带有缓冲运算符的查询可能会使用无限量的内存(如全局排序、哈希联接和无序聚合),这些查询已实现,但默认情况下尚未启用。有关默认情况下启用和不启用的全部详细信息,请查看矢量化执行引擎文档。要了解更多关于我们如何构建更复杂的矢量化运算符的信息,请查看我们关于矢量化哈希联接器和矢量化合并联接器的博客文章。