基于划分的聚类方法主要包括K-均值和K-中心点方法,本文为大家总结了K-均值算法及其拓展,连同程序一并献上。
一、K-均值算法
算法如下:
例如:给定数据集合D,任取K = 2个对象作为初始聚类中心。计算各个对象到K个中心点的距离(如欧式距离),并将每个对象赋给最近的中心点。然后,更新簇的平均值, 即重新计算每个簇中对象的平均值作为簇的中心点。由于簇的中心点发生变化,继续计算每个对象赋给最近的中心点重新划分簇,直到簇趋于稳定为止。
一般,采用平方误差准则作为收敛函数, 其定义如下:
其中,m_i为簇C_i的平均值。
二、K-均值算法例程
以下例程基于MATLAB,根据Github源码改编,已经调试通过,供大家参考。源码已经上传至Github,点击下方“阅读原文”获取相关资源。
例程链接:https://github.com/ZhouKanglei/K_means
main.m程序如下:
clear all;
close all;
%=======================================
% Initialising Data
%=======================================
file = '../Data/Aggregation.mat';
k = 7;
iterations = 20;
%=======================================
% Clustering
%=======================================
[clustered_data, error] = k_means(file, k, iterations);
%=======================================
% Plot
%=======================================
plot_data(clustered_data, error, k);
k_means.m程序如下:
function [clustered_data, error] = k_means(file, k, iterations)
%=======================================
% Initialising Data
%=======================================
load(file);
rows = size(data, 1);
cols = size(data, 2);
data = data(1:rows, 1:cols-1);
clusters = randi([1 k], rows, 1);
clustered_data = [data clusters];
mean_matrix = zeros(k, cols-1);
error = [];
%=======================================
% Calculating Mean
%=======================================
for i = 1:k
index = clustered_data(:, end) == i;
indexed_data = clustered_data(index, 1:end-1);
mean_matrix(i, :) = mean(indexed_data);
end
%=======================================
% Computing Error
%=======================================
error = get_error(clustered_data, mean_matrix);
fprintf('After initialization: error = %.4f \n', error);
%=======================================
% Starting Iterations
%=======================================
for p = 1:iterations
for q = 1:rows
%=======================================
% Deciding which Cluster data belongs
%=======================================
dist = get_euclidean(data(q, :), mean_matrix);
[minimum_row, minimum_col] = min(dist);
clusters(q) = minimum_col;
end
clustered_data = [data, clusters];
%=======================================
% Calculating Mean
%=======================================
for i = 1:k
index = clustered_data(:, end) == i;
indexed_data = clustered_data(index, 1:end-1);
mean_matrix(i, :) = mean(indexed_data);
end
%=======================================
% Computing Error
%=======================================
error = [error get_error(clustered_data, mean_matrix)];
fprintf('After iteration %d: error = %.4f \n', p, error(end));
end
end
function [error] = get_error(data, mean_matrix)
%=======================================
% Computing Error
%=======================================
error = 0;
for j = 1: size(data, 1)
c = data(j, end);
dist = get_euclidean(data(j, 1:end-1), mean_matrix(c, 1:end));
error = error + dist;
end
end
function [distance_matrix] = get_euclidean(data, mean_matrix)
%=======================================
% Calculating Euclidean
%=======================================
dist = data - mean_matrix;
dist = dist.^2;
dist = sum(dist, 2);
distance_matrix = sqrt(dist);
end
plot_data.m程序如下:
function [] = plot_data(clustered_data, error, k)
%=======================================
% Plot 2D data
%=======================================
figure(1);
dim = size(clustered_data, 2) - 1;
color = ['r', 'g', 'b', 'y', 'm', 'k', 'c'];
if dim == 2
for i_for = 1 : k
idx = find(clustered_data(:, end) == i_for);
plot(clustered_data(idx, 1), clustered_data(idx, 2),...
['w', 'o'], 'MarkerSize', 10, 'MarkerFacecolor', color(i_for), 'LineWidth', 1.5);
title(['Plot']);
xlabel(['\it x']);
ylabel(['\it y']);
set(gca, 'FontSize', 16, 'FontName', 'Times', 'LineWidth', 1.5);
hold on;
end
hold off;
end
%=======================================
% Plot 2D error
%=======================================
figure(2);
plot(error,...
'b-s', 'MarkerSize', 10, 'MarkerFacecolor', 'b', 'LineWidth', 1.5);
xlim([1 length(error)]);
title(['Error']);
xlabel(['Iteration No.']);
ylabel(['Error']);
set(gca, 'FontSize', 16, 'FontName', 'Times', 'LineWidth', 1.5);
end
三、例程执行结果
本文采用UCI数据集Aggregation作为实验数据,使用MATLAB R2016b 软件在配备为 Intel (R) Core (TM) i7-7500U 2.90 GHz CPU 和 8GB 内存的个人电脑上进行了所有的实验。
程序运行结果:
>> main
After initialization: error = 9489.2095
After iteration 1: error = 3593.0203
After iteration 2: error = 3365.6874
After iteration 3: error = 3311.8756
After iteration 4: error = 3308.9829
After iteration 5: error = 3308.9829
After iteration 6: error = 3308.9829
After iteration 7: error = 3308.9829
After iteration 8: error = 3308.9829
After iteration 9: error = 3308.9829
After iteration 10: error = 3308.9829
After iteration 11: error = 3308.9829
After iteration 12: error = 3308.9829
After iteration 13: error = 3308.9829
After iteration 14: error = 3308.9829
After iteration 15: error = 3308.9829
After iteration 16: error = 3308.9829
After iteration 17: error = 3308.9829
After iteration 18: error = 3308.9829
After iteration 19: error = 3308.9829
After iteration 20: error = 3308.9829
二维数据绘制示意图:
误差结果图:
转存失败重新上传取消
四、K-均值算法复杂度分析
假设迭代次数为t,在一次迭代中需要分别计算n个对象与k个中心的距离。因此,K-均值算法的算法复杂度为O(nkt)。
五、K-均值算法拓展
K-均值算法有很多变体,具体在以下方面有所不同:
- 初始k个平均值的选择
- 相异度的计算
- 计算聚类平均值的策略
在实际应用中,往往先用层次凝聚算法决定簇的数目,并产生初始聚类, 然后用迭代重定位改进聚类结果。
K-众数(k-modes)方法是K-均值的一个变体,它扩展了K-均值范例,用簇众数取代簇均值来聚类标称数据。它采用新的相异性度量来处理标称对象,采用基于频率的方法来更新簇的众数。
K-原型(k-prototype)方法集成k-均值和k-众数方法,对混合了数值和标称值的数据进行聚类。
参考文献
- Jiawei Han, Micheline Kamber & Jian Pei. Data Miining Concepts and Techniques (Third Edition).
- 范明,孟小峰译. 数据挖掘:概念和技术(第三版).
- 参考资源:https://github.com/jayshah19949596/Machine-Learning-Models/tree/master/K-Mean%20Clustering.