我正在使用hadoop map-reduce作业进行一些文本处理。我的工作完成了99.2%,并停留在上一个 map 工作上。

map 输出的最后几行如下所示。上次,当出现此问题时,我尝试打印出map发出的键值,并注意到其中一个键具有与之关联的大量值,并且我认为它在对这些值进行排序时似乎卡住了。然后,我停止从 map 作业中使用该键,并且该键工作正常。

我认为,同样的问题再次发生,并且打印键值对是一项繁琐的工作,因为该工作很耗时。有更好的选择吗?就像configure hadoop一样,如果它们花太多时间在排序上,就忘记了几个键。是否有这样的事情。

2010-10-20 14:43:32,274信息org.apache.hadoop.mapred.MapTask:溢出 map 输出:缓冲区已满= true
2010-10-20 14:43:32,274信息org.apache.hadoop.mapred.MapTask:bufstart = 0; bufend = 79698262; bufvoid = 99614720
2010-10-20 14:43:32,274信息org.apache.hadoop.mapred.MapTask:kvstart = 0; kvend = 6601;长度= 327680
2010-10-20 14:43:33,272信息org.apache.hadoop.mapred.MapTask:完成泄漏0
2010-10-20 14:50:44,113信息org.apache.hadoop.mapred.MapTask:溢出 map 输出:缓冲区已满= true
2010-10-20 14:50:44,113信息org.apache.hadoop.mapred.MapTask:bufstart = 79698262; bufend = 59800449; bufvoid = 99614720
2010-10-20 14:50:44,113信息org.apache.hadoop.mapred.MapTask:kvstart = 6601; kvend = 9039;长度= 327680
2010-10-20 14:50:44,864信息org.apache.hadoop.mapred.MapTask:完成泄漏1
2010-10-20 14:58:33,105 INFO org.apache.hadoop.mapred.MapTask:溢出 map 输出:缓冲区已满= true
2010-10-20 14:58:33,105 INFO org.apache.hadoop.mapred.MapTask:bufstart = 59800449; bufend = 39893455; bufvoid = 99614720
2010-10-20 14:58:33,105 INFO org.apache.hadoop.mapred.MapTask:kvstart = 9039; kvend = 11228;长度= 327680
2010-10-20 14:58:33,817信息org.apache.hadoop.mapred.MapTask:完成泄漏2
2010-10-20 15:06:48,675信息org.apache.hadoop.mapred.MapTask:溢出 map 输出:缓冲区已满= true
2010-10-20 15:06:48,675 INFO org.apache.hadoop.mapred.MapTask:bufstart = 39893455; bufend = 20000988; bufvoid = 99614720
2010-10-20 15:06:48,675 INFO org.apache.hadoop.mapred.MapTask:kvstart = 11228; kvend = 13286;长度= 327680
2010-10-20 15:06:49,395信息org.apache.hadoop.mapred.MapTask:完成泄漏3
2010-10-20 15:15:23,514信息org.apache.hadoop.mapred.MapTask:溢出 map 输出:缓冲区已满= true
2010-10-20 15:15:23,514 INFO org.apache.hadoop.mapred.MapTask:bufstart = 20000988; bufend = 78879; bufvoid = 99614720
2010-10-20 15:15:23,514 INFO org.apache.hadoop.mapred.MapTask:kvstart = 13286; kvend = 15265;长度= 327680
2010-10-20 15:15:24,230信息org.apache.hadoop.mapred.MapTask:完成泄漏4
2010-10-20 15:24:35,797 INFO org.apache.hadoop.mapred.MapTask:溢出 map 输出:缓冲区已满= true
2010-10-20 15:24:35,797 INFO org.apache.hadoop.mapred.MapTask:bufstart = 78879; bufend = 79807573; bufvoid = 99614720
2010-10-20 15:24:35,797 INFO org.apache.hadoop.mapred.MapTask:kvstart = 15265; kvend = 17188;长度= 327680
2010-10-20 15:24:36,500 INFO org.apache.hadoop.mapred.MapTask:完成泄漏5
2010-10-20 15:33:33,391信息org.apache.hadoop.mapred.MapTask:溢出 map 输出:缓冲区已满= true
2010-10-20 15:33:33,391 INFO org.apache.hadoop.mapred.MapTask:bufstart = 79807573; bufend = 59907680; bufvoid = 99614720
2010-10-20 15:33:33,391 INFO org.apache.hadoop.mapred.MapTask:kvstart = 17188; kvend = 19074;长度= 327680
2010-10-20 15:33:34,114信息org.apache.hadoop.mapred.MapTask:完成泄漏6
2010-10-20 15:42:39,913信息org.apache.hadoop.mapred.MapTask:溢出 map 输出:缓冲区已满= true
2010-10-20 15:42:39,913信息org.apache.hadoop.mapred.MapTask:bufstart = 59907680; bufend = 40011208; bufvoid = 99614720
2010-10-20 15:42:39,913信息org.apache.hadoop.mapred.MapTask:kvstart = 19074;克文德= 20926;长度= 327680
2010-10-20 15:42:40,597信息org.apache.hadoop.mapred.MapTask:完成泄漏7
2010-10-20 15:51:49,668信息org.apache.hadoop.mapred.MapTask:溢出 map 输出:缓冲区已满= true
2010-10-20 15:51:49,668 INFO org.apache.hadoop.mapred.MapTask:bufstart = 40011208; bufend = 20111383; bufvoid = 99614720
2010-10-20 15:51:49,668 INFO org.apache.hadoop.mapred.MapTask:kvstart = 20926; kvend = 22759;长度= 327680
2010-10-20 15:51:50,378信息org.apache.hadoop.mapred.MapTask:完成泄漏8
2010-10-20 16:01:05,893信息org.apache.hadoop.mapred.MapTask:溢出 map 输出:缓冲区已满= true
2010-10-20 16:01:05,893 INFO org.apache.hadoop.mapred.MapTask:bufstart = 20111383; bufend = 196929; bufvoid = 99614720
2010-10-20 16:01:05,894信息org.apache.hadoop.mapred.MapTask:kvstart = 22759; kvend = 24572;长度= 327680
2010-10-20 16:01:06,634信息org.apache.hadoop.mapred.MapTask:完成泄漏9
2010-10-20 16:10:25,000 INFO org.apache.hadoop.mapred.MapTask:溢出 map 输出:缓冲区已满= true
2010-10-20 16:10:25,000 INFO org.apache.hadoop.mapred.MapTask:bufstart = 196929; bufend = 79900267; bufvoid = 99614720
2010-10-20 16:10:25,000 INFO org.apache.hadoop.mapred.MapTask:kvstart = 24572;克文德= 26370;长度= 327680
2010-10-20 16:10:25,776信息org.apache.hadoop.mapred.MapTask:完成泄漏10
2010-10-20 16:19:48,283信息org.apache.hadoop.mapred.MapTask:溢出 map 输出:缓冲区已满= true
2010-10-20 16:19:48,283信息org.apache.hadoop.mapred.MapTask:bufstart = 79900267; bufend = 59993676; bufvoid = 99614720
2010-10-20 16:19:48,284 INFO org.apache.hadoop.mapred.MapTask:kvstart = 26370; kvend = 28152;长度= 327680
2010-10-20 16:19:49,042信息org.apache.hadoop.mapred.MapTask:完成泄漏11


最佳答案

Hadoop中没有什么可以知道map()的特定调用会发出大量的键值对。我猜想在您的map()函数中,有某种循环会发出这些键值对。如果循环发出多于N对的信号,则可以简单地将其编码为短路。

另一种选择是找出一种对输入值进行分区的方法,以便映射器处理更细粒度的块,以便所有映射器都进行大致相同的工作量。

我不确定您要做什么,因此这些建议可能不适用。希望这可以帮助。