protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析

转载

mob64ca140ee96c 2024-05-07 09:38:13

文章标签 java 开发语言字段 ci 编码方式 文章分类 架构后端开发

实现原理

序列化是如何实现的？

`message sku_feature { int64 sku_id = 1; int32 cid1 = 2; float price = 3; int32 cid2 = 4; int32 cid3 = 5; }`

Tag - Length - Value（标识 - 长度 - 字段值） 编码存储方式

以 标识 - 长度 - 字段值 表示每个字段，所有字段拼接成一个 字节流，从而实现编码存储的功能
示意图

protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析_java

tag

uint32 ： field_number << 3 | wire type

例如：int64 sku_id = 1;

tag生成：1 << 3 | 0 = 8

protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析_ci_02

length

可选字段：目前只有类型2需要，例如字符串，length会存储字符串长度。

value

不同类型的value值会有不同的编码方式。下面对每种类型进行逐一讲解。

1 Wire Type = 0时的数据编码方式

采用了两种编码方式：Varint & Zigzag

protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析_ci_03

1.1 Varint编码方式介绍

定义：一种变长的编码方式
原理：将数据按7个bit为一组进行分组，每分组前加1bit标示是否有下一组数据

protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析_字段_04

这样就可以用更少的字节表示数字，达到压缩的目的。

采用 Varint编码，对于很小的 int32 类型数字，则可以用 1个字节来表示
虽然大的数字会需要 5 个字节来表示，但大多数情况下，消息都不会有很大的数字，所以采用 Varint方法总是可以用更少的字节数来表示数字

如何解析经过Varint 编码的字节

protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析_java_05

编码 inline uint8* CodedOutputStream::WriteVarint64ToArray(uint64 value, uint8* target) { while (value >= 0x80) { target = static_cast<uint8>(value \| 0x80); value >>= 7; ++target; } target = static_cast<uint8>(value); return target + 1; } 解码 bool CodedInputStream::ReadVarint64Slow(uint64* value) { // Slow path: This read might cross the end of the buffer, so we // need to check and refresh the buffer if and when it does. uint64 result = 0; int count = 0; uint32 b; do { if (count == kMaxVarintBytes) { value = 0; return false; } while (buffer_ == buffer_end_) { if (!Refresh()) { value = 0; return false; } } b = buffer_; result \|= static_cast<uint64>(b & 0x7F) << (7 count); Advance(1); ++count; } while (b & 0x80); *value = result; return true; }

编码
inline uint8* CodedOutputStream::WriteVarint64ToArray(uint64 value,
uint8* target) {
while (value >= 0x80) {
*target = static_cast<uint8>(value | 0x80);
value >>= 7;
++target;
}
*target = static_cast<uint8>(value);
return target + 1;
}
解码
bool CodedInputStream::ReadVarint64Slow(uint64* value) {
// Slow path:  This read might cross the end of the buffer, so we
// need to check and refresh the buffer if and when it does.

uint64 result = 0;
int count = 0;
uint32 b;

do {
if (count == kMaxVarintBytes) {
*value = 0;
return false;
}
while (buffer_ == buffer_end_) {
if (!Refresh()) {
*value = 0;
return false;
}
}
b = *buffer_;
result |= static_cast<uint64>(b & 0x7F) << (7 * count);
Advance(1);
++count;
} while (b & 0x80);

*value = result;
return true;
}

Varint 编码方式的不足

问题：如果采用 Varint编码方式表示一个负数，那么一定需要 5 个 byte。

因为最高位bit是1，例如int32类型 -1: 100000000 00000000 00000000 00000001 ，使用varint编码 ceil(4*8/7) = 5

解决方案： protobuf会先采用 Zigzag 编码，再采用 Varint编码

1.2 Zigzag编码方式详解

定义：一种变长的编码方式
原理：使用无符号数来表示有符号数字；
作用：使得绝对值小的数字都可以采用较少字节来表示；
实例说明：将 -2进行 Zigzag编码：

protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析_java_06

Zigzag 编码是补充 Varint编码在表示负数的不足，从而更好的帮助 Protocol Buffer进行数据的压缩
所以，如果提前预知字段值是可能取负数的时候，记得采用sint32 / sint64 数据类型

`inline uint32 WireFormatLite::ZigZagEncode32(int32 n) { // Note: the right-shift must be arithmetic // Note: left shift must be unsigned because of overflow return (static_cast<uint32>(n) << 1) ^ static_cast<uint32>(n >> 31); } inline int32 WireFormatLite::ZigZagDecode32(uint32 n) { // Note: Using unsigned types prevent undefined behavior return static_cast<int32>((n >> 1) ^ (~(n & 1) + 1)); }`

inline uint32 WireFormatLite::ZigZagEncode32(int32 n) {
// Note:  the right-shift must be arithmetic
// Note:  left shift must be unsigned because of overflow
return (static_cast<uint32>(n) << 1) ^ static_cast<uint32>(n >> 31);
}

inline int32 WireFormatLite::ZigZagDecode32(uint32 n) {
// Note:  Using unsigned types prevent undefined behavior
return static_cast<int32>((n >> 1) ^ (~(n & 1) + 1));
}

总结 : Protocol Buffer 通过Varint和Zigzag编码后大大减少了字段值占用字节数。

2 Wire Type = 1& 5时的编码&数据存储方式

protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析_编码方式_07

固定用4/8个字节表示

`inline uint64 WireFormatLite::EncodeDouble(double value) { union {double f; uint64 i;}; f = value; return i; } inline double WireFormatLite::DecodeDouble(uint64 value) { union {double f; uint64 i;}; i = value; return f; }`

inline uint64 WireFormatLite::EncodeDouble(double value) {
union {double f; uint64 i;};
f = value;
return i;
}

inline double WireFormatLite::DecodeDouble(uint64 value) {
union {double f; uint64 i;};
i = value;
return f;
}

3 Wire Type = 2时的编码 & 数据存储方式

protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析_编码方式_08

讲解三种数据类型：

String类型
嵌套消息类型（Message）
通过packed修饰的 repeat 字段（即packed repeated fields）

3.1 String类型

字段值（即V）采用UTF-8编码

protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析_ci_09

例子：

message Test2
{
    required string str = 2;
}

// 将str设置为：testing
Test2.setStr（“testing”）

// 经过protobuf编码序列化后的数据以二进制的方式输出
// 输出为：18, 7, 116, 101, 115, 116, 105, 110, 103

protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析_开发语言_10

3.2 嵌套消息类型（Message）

存储方式：T - L - V

内部消息编码的T - L -V组成外部消息的v

protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析_开发语言_11

实例
定义如下嵌套消息：

message Test2
{
    required string str = 1;
    required int32 id1 = 2；
}

message Test3 {
  required Test2 c = 1;
}

// 将Test2中的字段str设置为：testing
// 将Test2中的字段id1设置为：296
// 编码后的字节为：10 ，12 ，18，7，116, 101, 115, 116, 105, 110, 103，16，-88，2

protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析_java_12

3.3 通过packed修饰的 repeat 字段

repeated 修饰的字段有两种表达方式：

message Test
{
    repeated int32 Car = 4 ;
    // 表达方式1：不带packed=true

    repeated int32 Car = 4 [packed=true];
    // 表达方式2：带packed=true
    // proto 2.1 开始可使用
}


// 在代码中给`repeated int32 Car`附上3个字段值：3、270、86942

Test.setCar（3）；
Test.setCar（270）；
Test.setCar（86942）；

背景：，即数据类型 & 标识号都相同

protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析_java_13

问题：对于同一个 repeated字段、多个字段值来说，他们的Tag都是相同的，会导致Tag的冗余，即相同的Tag存储多次；
解决方案：采用带packed=true 的 repeated 字段存储方式，即将相同的 Tag 只存储一次、记一个长度Length字段：Tag - Length - Value -Value -Value。

protobufhessian msgPack 等序列化技术对比 protobuf序列化的原理分析_java_14

通过采用带packed=true 的 repeated 字段存储方式，从而更好地压缩序列化后的数据长度。

特别注意

packed修饰只用于基本类型的repeated字段
用在其他字段，编译 .proto 文件时会报错

总结

protobuf编码/解码方式简单，只需要简单的数学运算、位移等，序列化 & 反序列化速度很快
protobuf采用了独特的编码方式，如Varint、Zigzag编码方式等等，采用T - L - V 的数据存储方式，数据存储得紧凑，数据压缩效果好

使用建议

根据上面的序列化原理分析，有以下使用建议：

建议1：字段标识号（Field_Number）尽量只使用 1-15，且不要跳动使用
因为Tag里的Field_Number是需要占字节空间的。如果Field_Number>16时，Field_Number的编码就会占用2个字节，那么Tag在编码时也就会占用更多的字节；如果将字段标识号定义为连续递增的数值，将获得更好的编码和解码性能
建议2：若需要使用的字段值出现负数，请使用 sint32 / sint64，不要使用int32 / int64
因为采用sint32 / sint64数据类型表示负数时，会先采用Zigzag编码再采用Varint编码，从而更加有效压缩数据
建议3：对于repeated字段，尽量增加packed=true修饰
因为加了packed=true修饰repeated字段采用连续数据存储方式，即T - L - V - V -V方式