数据分区详解
数据分区的五种常用方式:
1、随机分区
优点:数据分布均匀
缺点:具有相同特点的数据不会保证被分配到相同的分区
2、Hash分区
优点:具有相同特点的数据保证被分配到相同的分区
特点:会产生数据倾斜
3、范围分区
缺点:提高查询速度,相邻的数据都在相同的分区
缺点:部分分区的数据量会超出其他的分区,需要进行裂变以保持所有分区的数据量是均匀的。如果每个分区不排序,那么裂变就会非常困难
4、轮询分区
负载均衡算法的一种
优点:确保一定不会出现数据倾斜
缺点:无法根据存储/计算能力分配存储/计算压力
5、自定义分区
请参考Flink的分区规则:
public static enum PartitionMethod {
REBALANCE, // round-robin 分区
HASH, // hash散列
RANGE, // 范围分区
CUSTOM; // 自定义
}
请看MapReduce的自定义分区的Partitioner接口的定义
/**
* Partitions the key space.
*
* <p><code>Partitioner</code> controls the partitioning of the keys of the
* intermediate map-outputs. The key (or a subset of the key) is used to derive
* the partition, typically by a hash function. The total number of partitions
* is the same as the number of reduce tasks for the job. Hence this controls
* which of the <code>m</code> reduce tasks the intermediate key (and hence the
* record) is sent for reduction.</p>
*
* Note: If you require your Partitioner class to obtain the Job's configuration
* object, implement the {@link Configurable} interface.
*
* @see Reducer
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class Partitioner<KEY, VALUE> {
/**
* Get the partition number for a given key (hence record) given the total
* number of partitions i.e. number of reduce-tasks for the job.
*
* <p>Typically a hash function on a all or a subset of the key.</p>
*
* @param key the key to be partioned.
* @param value the entry value.
* @param numPartitions the total number of partitions.
* @return the partition number for the <code>key</code>.
*/
public abstract int getPartition(KEY key, VALUE value, int numPartitions);
}
请看Flink的自定义分区的接口Paritioner的定义:
/**
* Function to implement a custom partition assignment for keys.
*
* @param <K> The type of the key to be partitioned.
*/
@Public
@FunctionalInterface
public interface Partitioner<K> extends java.io.Serializable, Function {
/**
* Computes the partition for the given key.
*
* @param key The key.
* @param numPartitions The number of partitions to partition into.
* @return The partition index.
*/
int partition(K key, int numPartitions);
}
有个共同的特点就是:
你把元素交给这个分区器,这个分区器的一个方法逻辑来决定这个元素被分发到哪个分区。
6、测试代码
package com.aura.funny.partition;
import java.util.*;
/**
* 作者: 马中华
* 时间: 2019/6/27 14:02
* 描述:
* 关于数据分区的代码测试
*/
public class PartitionTest02 {
public static void main(String[] args) {
/**
* 待分区的数据集
*/
List<String> data = Arrays.asList(
"a", "b", "c", "d", "e", "f", "g",
"h", "i", "j", "k", "l", "m", "n",
"o", "p", "q", "r", "s", "t",
"u", "v", "w", "x", "y", "z",
"a", "a", "a", "a",
"a", "a", "a", "a",
"a", "a", "a", "a",
"b", "b", "b", "b");
/**
* 分区个数
*/
int partitionNumber = 5;
/**
* 第一招:Hash散列
*/
System.out.println("\n---------第一招:Hash散列------------");
List<List<String>> partitionList1 = partitionData(data, new Partitioner() {
@Override
public int getPartition(String item, int numPartitions) {
return (item.hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}, partitionNumber);
printPartitionedData(partitionList1);
/**
* 第二招:随机分区
*/
System.out.println("\n---------第二招:随机分区------------");
List<List<String>> partitionList2 = partitionData(data, new Partitioner() {
Random random = new Random();
@Override
public int getPartition(String item, int numPartitions) {
return random.nextInt(numPartitions);
}
}, partitionNumber);
printPartitionedData(partitionList2, false);
/**
* 第三招:轮询散列
*/
System.out.println("\n---------第三招:轮询散列------------");
List<List<String>> partitionList3 = partitionData(data, new Partitioner() {
int counter = 0;
@Override
public int getPartition(String item, int numPartitions) {
int partitionIndex = counter;
counter++;
if (counter == numPartitions) {
counter = 0;
}
return partitionIndex;
}
}, partitionNumber);
printPartitionedData(partitionList3, false);
/**
* 第四招:范围分区
*/
System.out.println("\n---------第四招:范围分区------------");
List<List<String>> partitionList4 = partitionData(data, new Partitioner() {
@Override
public int getPartition(String item, int numPartitions) {
// 确定范围分界点
Set datas = new HashSet<String>(data);
List<String> distinctItemList = new ArrayList<String>(datas);
Collections.sort(distinctItemList);
int step = distinctItemList.size() / numPartitions + 1;
int index = distinctItemList.indexOf(item);
int partitionNum = index / step;
return partitionNum;
}
}, partitionNumber);
printPartitionedData(partitionList4);
/**
* 第五招:自定义分区
*/
System.out.println("\n---------第五招:自定义分区------------");
List<List<String>> partitionList5 = partitionData(data, new Partitioner() {
@Override
public int getPartition(String item, int numPartitions) {
/**
* 在此,自定义分区的逻辑即可。决定item这个元素到底被放置到哪个分区中。
*/
return 0;
}
}, partitionNumber);
printPartitionedData(partitionList5, false);
}
/**
* 分区方法
*/
public static List<List<String>> partitionData(List<String> data, Partitioner partitioner, int numPartitions){
List<List<String>> partitionList = initPartitionContext(numPartitions);
for (String item : data) {
// 按照每个元素的hash值分配分区编号
int partitionNum = partitioner.getPartition(item, numPartitions);
partitionList.get(partitionNum).add(item);
}
return partitionList;
}
/**
* 初始化装载分区数据的容器
*/
public static List<List<String>> initPartitionContext(int numPartitions){
List<List<String>> partitionList = new ArrayList<List<String>>();
// 先创建存储每个分区数据的List
for (int i = 0; i < numPartitions; i++) {
partitionList.add(new ArrayList<String>());
}
return partitionList;
}
/**
* 打印被分区的数据集,分区数据要进行排序
*/
public static void printPartitionedData(List<List<String>> partitionResult){
printPartitionedData(partitionResult, true);
}
/**
* 打印被分区的数据集,根据需要是否排序分区的数据
*/
public static void printPartitionedData(List<List<String>> partitionResult, boolean sort){
if(sort){
// 给每个分区的数据排序,为了结果好看
for (List<String> partition : partitionResult) {
if(partition.size() != 0 && partition != null){
Collections.sort(partition);
}
}
}
// 打印输出每个分区的数据
for (List<String> partition : partitionResult) {
if(partition.size() != 0 && partition != null){
String allItem = "";
for (String item : partition) {
allItem += (item + ",");
}
System.out.println(allItem.substring(0, allItem.length() - 1));
}else{
System.out.println("该分区的数据为空");
}
}
}
}
/**
* 一个定义分区逻辑的接口
*/
interface Partitioner{
int getPartition(String item, int numPartitions);
}
各位把代码拿下去,直接就可运行看效果、!!
7、效果
在这里,我也给大家贴一份代码执行的效果
---------第一招:Hash散列------------
d,i,n,s,x
e,j,o,t,y
a,a,a,a,a,a,a,a,a,a,a,a,a,f,k,p,u,z
b,b,b,b,b,g,l,q,v
c,h,m,r,w
---------第二招:随机分区------------
b,c,f,h,j,o,u,y,a,a,a,b
d,v,a,a
a,e,l,n,r,t,w,x,z,a,a,a,b,b
k,p,q,s,a,b
g,i,m,a,a,a
---------第三招:轮询散列------------
a,f,k,p,u,z,a,a,b
b,g,l,q,v,a,a,a,b
c,h,m,r,w,a,a,a
d,i,n,s,x,a,a,b
e,j,o,t,y,a,a,b
---------第四招:范围分区------------
a,a,a,a,a,a,a,a,a,a,a,a,a,b,b,b,b,b,c,d,e,f
g,h,i,j,k,l
m,n,o,p,q,r
s,t,u,v,w,x
y,z
---------第五招:自定义分区------------
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a,a,a,a,a,a,a,a,a,a,a,a,b,b,b,b
该分区的数据为空
该分区的数据为空
该分区的数据为空
该分区的数据为空