如何编写MapReduce代码

转载

mob604756fcd161 2011-12-01 04:16:00

文章标签 hadoop ide 编程经验 mapreduce perl 文章分类 代码人生

关于maperduce，可以参考：http://en.wikipedia.org/wiki/MapReduce

这里假设你具备一定的hadoop编程经验。

Mapper接受原始输入，比如网站日志，分析并输出中间结果。经历排序，分组成为Reducer的输入，经过统计汇总，输出结果。当然这个过程可以是多个。

其中Mapper比较简单，但是需要对输入具有深入的理解，不光是格式还包括意义。其中有如下注意：

一条输入尽量不要拓展为多条输出，因为这会增加网络传输
对于partition的key要仔细选择，这会决定有多少reducer，确保这个的结果尽量均匀分布

reducer其实有现实的模板，这个是我要重点介绍的。下面的例子都是基于Perl语言。

对于简单的输入，模板如下：

# read configuration
# initiate global vairables 
# initiate key level counter
# initiate group level counter
# initiate final counter

### reset all key level counter
sub onBeginKey() {}

### aggregate count
sub onSameKey {}

### print out the counter
sub onEndKey() {}

### main loop
while (<STDIN>) {
  chomp($_);

  # step 1:filter input

  # step 2: split input

  # step 3: get group and key

  # main logic
  if ($cur_key) {
    if ( $key ne $cur_key ) {
      &onEndKey();
      &onBeginKey();
    }
    &onSameKey();
  }
  else {
    &onBeginKey();
    &onSameKey();
  }
}
if ($cur_key) {
  &onEndKey();
}

对于复杂的输入，模板如下：

# read configuration
# initiate global vairables 
# initiate key level counter
# initiate group level counter
# initiate final counter

### reset all group level counter
sub onBeginGroup() {}

### reset all key level counter
sub onBeginKey() {}

### add count at key level
sub onSameKey {}

### aggregate count from key level to group level
sub onEndKey() {}

### aggregate count from group level to final result
sub onEndGroup() {}

### main loop
while (<STDIN>) {
  chomp($_);

  # step 1:filter input

  # step 2: split input

  # step 3: get group and key

  # main logic
  if ($cur_group) {
    if ( $group ne $cur_group ) {
      &onEndKey();
      &onEndGroup();
      &onBeginGroup();
      &onBeginKey();
    }
    else {
      if ( $key ne $cur_key ) {
        &onEndKey();
        &onBeginKey();
      }    #else just the same key
    }
    &onSameKey();
  }
  else {
    &onBeginGroup();
    &onBeginKey();
    &onSameKey();
  }
}
if ($cur_key) {
  &onEndKey();
  &onEndGroup();
}

### print out the final counter