impala报错 has an incompatible parquet schema for column impala 查询报错

转载

mob64ca1407d5aa 2024-05-06 20:41:00

文章标签 大数据 impala max_row_size set 字符串 文章分类 机器学习人工智能

报错内容

ERROR: Row of size 488.29 KB could not be materialized in plan node with id 1. Increase the max_row_size query option (currently 500.00 B) to process larger rows.
一般更多的可能是

ERROR: Row of size xxxx MB could not be materialized in plan node with id 1. Increase the max_row_size query option (currently 524288.00 B) to process larger rows.

解决方法

增大MAX_ROW_SIZE

set MAX_ROW_SIZE=5242880;

原因

复制解决办法很简单，但是我们需要知道根本的原因，否则永远都在百度学习别人的路上。

附上官网解决方法，开始学习下。

https://impala.apache.org/docs/build/html/topics/impala_max_row_size.html

MAX_ROW_SIZE 参数含义是什么？

Ensures that Impala can process rows of at least the specified size. (Larger rows might be successfully processed, but that is not guaranteed.) Applies when constructing intermediate or final rows in the result set. This setting prevents out-of-control memory use when accessing columns containing huge strings.

简单的来说就是impala能够处理的行大小，如果比这个设定的值大点，可能会成功但不能保证。

默认值

524288 (512 KB）

为什么是512kb？个人认为这个512就是比较大的考虑范围了。因为一条数据如果有512kb大小，那就是相当大了，很多人可能没有感觉，现在一般来说字符串用的比较多，一个字符串是4个字节，1kb=1024byte=256个字符串。也就是说如果你的表有512个字段，且每个字段都有256个字符，那么就刚好到达的这个上限。或者说你有几个字段desc comment 比较大，几个字段加起来有13w字符串长。。。。基本是不可能但是有时候tm就出现了有什么办法呢？

使用这个注意事项

If a query fails because it involves rows with long strings and/or many columns, causing the total row size to exceed MAX_ROW_SIZE bytes, increase the MAX_ROW_SIZE setting to accommodate the total bytes stored in the largest row. Examine the error messages for any failed queries to see the size of the row that caused the problem.

Impala attempts to handle rows that exceed the MAX_ROW_SIZE value where practical, so in many cases, queries succeed despite having rows that are larger than this setting.

Specifying a value that is substantially higher than actually needed can cause Impala to reserve more memory than is necessary to execute the query.

In a Hadoop cluster with highly concurrent workloads and queries that process high volumes of data, traditional SQL tuning advice about minimizing wasted memory is worth remembering. For example, if a table has STRING columns where a single value might be multiple megabytes, make sure that the SELECT lists in queries only refer to columns that are actually needed in the result set, instead of using the SELECT * shorthand.

如果查询失败可能是包含的行有很多字符串和列，导致的行的总size超过了MAX_ROW_SIZE 的字节数，然后就是适当增大MAX_ROW_SIZE 适应的你的表结构

超过设置的也有可能成功

如果你设置的这个值很高可能会导致impala在查询过程中需要更多的memory

不要用select * 尽量用你的 select id name 这种指定字段

实战

实践是检验真理的唯一标准！！！！！！说什么再多也不如实践。

create table test.big_strings (s1 string, s2 string, s3 string) stored as parquet;
 set compression_codec=none;
 insert into test.big_strings values ('one', 'two', 'three');
 insert into test.big_strings values (repeat('12345',100000), 'short', repeat('123',10000));
 insert into test.big_strings values (repeat('12345',100000), 'short', repeat('12345',100000));
 compute stats test.big_strings

impala报错 has an incompatible parquet schema for column impala 查询报错_impala

第一行插入了50w+50w+5

第二行插入50w+30w+5

第三行插入了3+3+5

现在我们把max_row_size 设置一个较小的值看下是否报错

set max_row_size=500;

impala报错 has an incompatible parquet schema for column impala 查询报错_字符串_02

488.29kb超过了500b

500b就是我们设置的set max_row_size=500;

那488.29kb怎么来的？注意我们查的是count( distinct s1) 那么s1的最大值=50w个字符也就是

50w/1024=488.28125kb 因为字符串12345这种只有1个字节，如果是汉字陈池就是4个字节。

50w个数字就是50w个字节就是488.29kb所以在count(distinct s1)的时候就会超过我们设置的500

那么设置多少不会报错呢？直接设置set max_row_size=50w？

impala报错 has an incompatible parquet schema for column impala 查询报错_impala_03

可以看到设置到大概30w的时候就不会报错了，说明之前impala说的超过了也能查询成功是可行的。经过反复测试在设置为26w左右的时候就有一定概率成功，但是不是每一次查询都会成功。

这是有一个问题了。demo里为啥要用count(distinct s1)呢？直接select * from table 不行么？

select * 也需要把所有的列查出来呀

select distinct s1 from test.big_strings;报错！！

select s1 from test.big_strings group by s1;报错！！！

select concat("aa",s1) from test.big_strings;不报错

select s1.s1,s2.s1
from test.big_strings s1 join test.big_strings s2
on s1.s2=s2.s2; 报错！！！

我设置的是20w 报错信息

Row of size 488.31 KB could not be materialized in plan node with id 2. Increase the max_row_size query option (currently 195.31 KB) to process larger rows

借用官方例子

-- A very small row. 这个insert无论怎样都会成功 因为非常小
insert into big_strings values ('one', 'two', 'three');
-- A row right around the default MAX_ROW_SIZE limit: a 500 KiB string and a 30 KiB string.
-- 这个s1是50w个字节小于默认的 512Kb  s3=30w 略小于30kb  
insert into big_strings values (repeat('12345',100000), 'short', repeat('123',10000));
-- A row that is too big if the query has to materialize both S1 and S3.
-- 这个s1 s3 都是50w 接近500kb，如果同时查询s1 s3 是肯定大于512kb的。
insert into big_strings values (repeat('12345',100000), 'short', repeat('12345',100000));

注意！！！我这里翻译的不标准，这里不是查询，有个值是materialize实现 物化的意思，

With the default MAX_ROW_SIZE setting, different queries succeed or fail based on which column values have to be materialized during query processing:

-- All the S1 values can be materialized within the 512 KB MAX_ROW_SIZE buffer.
select count(distinct s1) from big_strings;
--这里能查询成功因为 s1 都是500k<512Kb
+--------------------+
| count(distinct s1) |
+--------------------+
| 2                  |
+--------------------+

-- A row where even the S1 value is too large to materialize within MAX_ROW_SIZE.
--这里直接搞了一个大的 500w  差不多是5Mb>512kb
insert into big_strings values (repeat('12345',1000000), 'short', repeat('12345',1000000));

-- The 5 MiB string is too large to materialize. The message explains the size of the result
--这里说这个5M的字符串不能被物化 如下所示报错的4.77MB就是这个500w
-- set row the query is attempting to materialize.
select count(distinct(s1)) from big_strings;
WARNINGS: Row of size 4.77 MB could not be materialized in plan node with id 1.
  Increase the max_row_size query option (currently 512.00 KB) to process larger rows.

--上面报错了这个肯定报错，因为查询的更多了，9.54=500w+5*4+500w
-- If more columns are involved, the result set row being materialized is bigger.
select count(distinct s1, s2, s3) from big_strings;
WARNINGS: Row of size 9.54 MB could not be materialized in plan node with id 1.
  Increase the max_row_size query option (currently 512.00 KB) to process larger rows.

--查询s2是没问题的因为s2很小
-- Column S2, containing only short strings, can still be examined.
select count(distinct(s2)) from big_strings;
+----------------------+
| count(distinct (s2)) |
+----------------------+
| 2                    |
+----------------------+

--这里也没报错 是因为根本没必要物化s1 s2 s3 所以不涉及物化操作。
-- Queries that do not materialize the big column values are OK.
select count(*) from big_strings;
+----------+
| count(*) |
+----------+
| 4        |
+----------+

The following examples show how adjusting MAX_ROW_SIZE upward allows queries involving the long string columns to succeed:

--之前说了 s1最大的是500w差不多是4.77M
-- Boosting MAX_ROW_SIZE moderately allows all S1 values to be materialized.
set max_row_size=7mb; --这里设置了7M肯定能查询成功了

select count(distinct s1) from big_strings;
+--------------------+
| count(distinct s1) |
+--------------------+
| 3                  |
+--------------------+

--但是s1+s3还是很大差不多9M 还是有问题呀
-- But the combination of S1 + S3 strings is still too large.
select count(distinct s1, s2, s3) from big_strings;
WARNINGS: Row of size 9.54 MB could not be materialized in plan node with id 1. Increase the max_row_size query option (currently 7.00 MB) to process larger rows.

--所以这b采用了继续增大的做法。
-- Boosting MAX_ROW_SIZE to larger than the largest row in the table allows
-- all queries to complete successfully.
set max_row_size=12mb;

select count(distinct s1, s2, s3) from big_strings;
+----------------------------+
| count(distinct s1, s2, s3) |
+----------------------------+
| 4                          |
+----------------------------+

The following examples show how to reason about appropriate values for MAX_ROW_SIZE, based on the characteristics of the columns containing the long values:

--怎么看我到底要设置多大的max_row_size呢？
-- With a large MAX_ROW_SIZE in place, we can examine the columns to
-- understand the practical lower limit for MAX_ROW_SIZE based on the
-- table structure and column values.
--这个目的是查出哪一行的字符串最大? 注意这里可以忽略int decimal 这种毕竟占的几个字节 几十个字节直接可以忽略了。所以这里除以的是1e6 而不是1024*1024.
select max(length(s1) + length(s2) + length(s3)) / 1e6 as megabytes from big_strings;
+-----------+
| megabytes |
+-----------+
| 10.000005 |
+-----------+
这里的结果显示是10M 差不多接近我们实验的9.77MB

-- We can also examine the 'Max Size' for each column after computing stats.
compute stats big_strings;
show column stats big_strings;
+--------+--------+------------------+--------+----------+-----------+
| Column | Type   | #Distinct Values | #Nulls | Max Size | Avg Size  |
+--------+--------+------------------+--------+----------+-----------+
| s1     | STRING | 2                | -1     | 5000000  | 2500002.5 |
| s2     | STRING | 2                | -1     | 10       | 7.5       |
| s3     | STRING | 2                | -1     | 5000000  | 2500005   |
+--------+--------+------------------+--------+----------+-----------+
这个也能看出，每行的最大值，方便找出是哪一行和个sb一样塞了十几万的字符串！！！！！！！！！！！！

Related information:

BUFFER_POOL_LIMIT Query Option, DEFAULT_SPILLABLE_BUFFER_SIZE Query Option, MIN_SPILLABLE_BUFFER_SIZE Query Option, Scalability Considerations for Impala

Parent topic: Query Options for the SET Statement

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。