Hadoop源码解析之Partitioner类

原创

说文科技 2022-01-26 10:53:58 ©著作权

©著作权归作者所有：来自51CTO博客作者说文科技的原创作品，请联系作者获取转载授权，否则将追究法律责任

`Hadoop`源码解析之`Partitioner` 类

1. 类定义

public abstract class Partitioner<KEY,VALUE>
extends Object

抽象类
继承自Object

2. 类释义

Partitions the key space.
Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.
Note: If you require your Partitioner class to obtain the Job’s configuration object, implement the Configurable interface.

分区键空间。

Partitioner 控制map的输出键的分区。键（或者键的子集）被用于分区，典型的方式是通过一个哈希函数。总的分区数和 job 的 reducer的数相同。【…】。

Note: 如果你要求你的Partition 类去包含 Job的配置对象，请实现Configurable 接口

3. 方法详解

abstract int
  getPartition(KEY key, VALUE value, int numPartitions)
Get the partition number for a given key (hence record) given the total number of partitions i.e.

给定分区总数的情况下，根据给出的key（即记录），获取一个分区号。

注意在这里的getPartition(KEY key, VALUE value, int numPartitions) 是有一个KEY VALUE的。而这个KEY VALUE是需要跟mapper的输出保持同步的。即mapper输出的<key,value>是什么形式，那么这里的KEY VALUE就应该是什么形式

key 是一个输入键； value 是指；而 numPartitions 则是一个分区数。这个数和Reducer 的个数相同。其默认值是1。所以对于大多数没有显式设置numPartitions 的MapReducejob 来说，其值都是1，从而保证只有一份输出。如果需要显式设置，则使用下面的代码即可：

job.setNumReduceTasks(3);//将 reducer 的数目设置为3

4. 实现类

Partition 是一个抽象类，它的一些常用实现类有：HashPartitioner。下面详细的对这些实现类一一进行讲解。

4.1 `HashPartitioner`

4.1.1 类释义

Partition keys by their Object.hashCode().
通过Object.hashCode() 这个方法去对键分区

4.1.2 类方法

HashPartitioner() 构造器方法

HashPartitioner()

getPartition()

int
  getPartition(K key, V value, int numReduceTasks)
Use Object.hashCode() to partition.

Parameters:
key - the key to be partioned.
value - the entry value.
numReduceTasks - the total number of partitions.

5. 用法详解

在进行MapReduce计算时，需要把最终的输出数据分到不同的文件中，最终的输出数据来自于Reducer 任务。如果要得到多个文件输出，那么就意味着需要使用相同数量的Reducer 任务。而Reducer 任务来自 Mapper任务，所以就需要将Mapper中的输出Partition，然后发送给Reducer。进行Partition 的这个底层默认实现就是HashPartitioner。
注意partition的作用是将按照开发者的意愿将 mapper输出的KEY VALUE对发送到不同的 reducer中。这其中 开发者的意愿 的体现就是 getPartition(...) 这个方法的具体实现。
一般的MapReduce job中，只有一个reducer任务，所以即使使用了分区算法，也会分配到同一个分区中。如果想实现不同的<KEY VALUE>发到不同的partition中，就必须同时使用相同数量的reducer task。