前言
kmeans是最简单的聚类算法之一,但是运用十分广泛。最近在工作中也经常遇到这个算法。kmeans一般在数据分析前期使用,选取适当的k,将数据分类后,然后分类研究不同聚类下数据的特点。
本文记录学习kmeans算法相关的内容,包括算法原理,收敛性,效果评估聚,最后带上R语言的例子,作为备忘。
算法原理
kmeans的计算方法如下:
1 随机选取k个中心点
2 遍历所有数据,将每个数据划分到最近的中心点中
3 计算每个聚类的平均值,并作为新的中心点
4 重复2-3,直到这k个中线点不再变化(收敛了),或执行了足够多的迭代
时间复杂度:O(I*n*k*m)
空间复杂度:O(n*m)
其中m为每个元素字段个数,n为数据量,I为跌打个数。一般I,k,m均可认为是常量,所以时间和空间复杂度可以简化为O(n),即线性的。
算法收敛
从kmeans的算法可以发现,SSE其实是一个严格的坐标下降(Coordinate Decendet)过程。设目标函数SSE如下:
SSE(
,
,…,
) =
采用欧式距离作为变量之间的聚类函数。每次朝一个变量
的方向找到最优解,也就是求偏倒数,然后等于0,可得
c_i=
其中m是c_i所在的簇的元素的个数
也就是当前聚类的均值就是当前方向的最优解(最小值),这与kmeans的每一次迭代过程一样。所以,这样保证SSE每一次迭代时,都会减小,最终使SSE收敛。
由于SSE是一个非凸函数(non-convex function),所以SSE不能保证找到全局最优解,只能确保局部最优解。但是可以重复执行几次kmeans,选取SSE最小的一次作为最终的聚类结果。
0-1规格化
由于数据之间量纲的不相同,不方便比较。举个例子,比如游戏用户的在线时长和活跃天数,前者单位是秒,数值一般都是几千,而后者单位是天,数值一般在个位或十位,如果用这两个变量来表征用户的活跃情况,显然活跃天数的作用基本上可以忽略。所以,需要将数据统一放到0~1的范围,将其转化为无量纲的纯数值,便于不同单位或量级的指标能够进行比较和加权。具体计算方法如下:
其中
属于A。
轮廓系数
轮廓系数(Silhouette Coefficient)结合了聚类的凝聚度(Cohesion)和分离度(Separation),用于评估聚类的效果。该值处于-1~1之间,值越大,表示聚类效果越好。具体计算方法如下:
- 对于第i个元素x_i,计算x_i与其同一个簇内的所有其他元素距离的平均值,记作a_i,用于量化簇内的凝聚度。
- 选取x_i外的一个簇b,计算x_i与b中所有点的平均距离,遍历所有其他簇,找到最近的这个平均距离,记作b_i,用于量化簇之间分离度。
- 对于元素x_i,轮廓系数s_i = (b_i – a_i)/max(a_i,b_i)
- 计算所有x的轮廓系数,求出平均值即为当前聚类的整体轮廓系数
从上面的公式,不难发现若s_i小于0,说明x_i与其簇内元素的平均距离小于最近的其他簇,表示聚类效果不好。如果a_i趋于0,或者b_i足够大,那么s_i趋近与1,说明聚类效果比较好。
K值选取
在实际应用中,由于Kmean一般作为数据预处理,或者用于辅助分类贴标签。所以k一般不会设置很大。可以通过枚举,令k从2到一个固定值如10,在每个k值上重复运行数次kmeans(避免局部最优解),并计算当前k的平均轮廓系数,最后选取轮廓系数最大的值对应的k作为最终的集群数目。
%%%%%%%%%%%Add cell breathing
%The UAV coverage radius is R_m = 100m
%The area is 100*100m
clear;clc;
lengthOfArea =400;
R_m = 100;
n_users = 50;
users_dist = rand(n_users,2) * lengthOfArea; % The coordinate of every user
min_users_inaUAV = 5;
max_users_in_aUAV = 10;
power_Trans = 40;
R_cov_all = 0;
iteration = 10;
rate = zeros(iteration,1);
SINR_threshold = -5;% The threshold is dm
Rate2_low = [];
powers_collect = [];
%plot(users_dist(:,1),users_dist(:,2)3'r*');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%===================Kmeans===========
stop=0;
k_index_old = 0;
powers_all_old = 0;
%%%%%%% Kmeans tryting
for k_cluster = 3:40;
change = 1;
while change == 1;
height_all = zeros(k_cluster,1);
[k_index,k_center] = kmeans(users_dist,k_cluster);
if k_index == k_index_old
change = 0;
end
k_index_old = k_index;
%===================================
%%%%%%%%%%%%%%%%%%%
%----------In order to avoid too much overlapped area, I modify Kmeans by
%-----------------selecting the centroid from the middle of farthest users in
%-----------------one cluster
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%=========find the farthest distance in one cluster
max_k_all = [];
k_means_center = [];
rate = [];
users_data=[];
for k_k = 1:k_cluster
max_all = [];
users_in_acluster = zeros(size(users_dist(k_index==k_k,:),1),2);
users_in_acluster = users_dist(k_index==k_k,:);
for user_index = 1:size(users_in_acluster,1)
[max_dista, max_index]= max((users_in_acluster(user_index,1) - users_in_acluster(:,1)).^2+(users_in_acluster(user_index,2) - users_in_acluster(:,2)).^2);
max_all = [max_all;[sqrt(max_dista), user_index, max_index]];
end
[max_k,max_index]= max(max_all(:,1));
max_index_in_acluster = max_all(max_index,2:3);
coor_center = sum(users_in_acluster(max_index_in_acluster,:),1)/2;
radiusOfCenter = norm(users_in_acluster(max_index_in_acluster(1),:)-users_in_acluster(max_index_in_acluster(2),:))/2;
k_means_center = [k_means_center; coor_center radiusOfCenter];%store the locations and radius of each cluster
end
%---------------------------------The locations of clusters have been fixed----------------------
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%------------------The overlapped area---------------------------------
all_overlap_area = 0;
all_area = 0;
for k = 1:length(k_means_center)-1
for k_left = k+1:length(k_means_center)
radius_two_cluster = sqrt(sum((k_means_center(k,1:2)-k_means_center(k_left,1:2)).^2));
if radius_two_cluster < (k_means_center(k,3)+k_means_center(k_left,3))
r_1 = k_means_center(k,3); r_2 = k_means_center(k_left,3); D = radius_two_cluster;
theta_1 = acos((r_1^2+D^2-r_2^2)/(2*r_1*D));
theta_2 = acos((r_2^2+D^2-r_1^2)/(2*r_2*D));
overlap_area = theta_1*r_1^2 + theta_2*r_2^2-r_1*D*sin(theta_1);
all_overlap_area = all_overlap_area+overlap_area;
end
end
end
%---------------------------------------------------------------------
%r_distance = zeros(length(users_dist),length(k_means_center));
users_sum = 0;
%%%%%%%%%%%%%%%%%%%%%%
%--------Cell breathing to constrain the number of users.
users_overlap = 0;
for k = 1:k_cluster
r_distance = 0;
users_in_cell = users_dist(k_index==k,:);
for k_users = 1: size(users_in_cell,1)
r_distance = norm(users_in_cell(k_users,:)-k_means_center(k,1:2));
if r_distance>k_means_center(k,3)
k_means_center(k,3) = r_distance;
end
end
end
%--------------cell breathing
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%计算所有重叠部分的用户数
for k = 1:k_cluster
users_in_cell = users_dist(k_index==k,:);
for k_users = 1: size(users_in_cell,1)
for k_cluster_else = 1:k_cluster
if k_cluster_else~=k && norm(users_in_cell(k_users,:)-k_means_center(k_cluster_else,1:2))<k_means_center(k_cluster_else,3)
users_overlap = users_overlap+1;
end
end
end
end
% if length(users_in_cluster)>= max_users_in_aUAV
% [~,dis_label] = sort(r_distance(:,k));
% k_means_center(k,3) = r_distance(dis_label(max_users_in_aUAV),k);
% end
% users_sum = users_sum + sum(r_distance(:,k)<=k_means_center(k,3));
%--------------This function has been closed.
%%%%%%%%%%%%%%%%%%%%%%
%--------------For Power Allocation
index_users = users_data(:,1);
for cluster_for_p = 1:k_cluster
users_PL_cluster_p = users_data(index_users==cluster_for_p,2:end);
SINR_cluster = [];
for user_in_c_p = 1:size(users_PL_cluster_p,1)
SINR_interf = 0;
SINR_signal = powers_all(cluster_for_p) - users_PL_cluster_p(user_in_c_p,1);
count = 1;
for user_intf = 1:k_cluster
if user_intf~=cluster_for_p
count = count+1;
SINR_interf = SINR_interf + 10.^((powers_all(cluster_for_p)-users_PL_cluster_p(user_in_c_p,count))/10);
end
end
SINR_user = SINR_signal- 10*log10(SINR_interf);
SINR_cluster = [SINR_cluster; SINR_user 10*log10(SINR_interf)];
end
[min_SINR, min_user] = min(SINR_cluster(:,1));
powers_all(cluster_for_p) = SINR_threshold + SINR_cluster(min_user,2) + users_PL_cluster_p(min_user,1);
end
powers_collect = [powers_collect; powers_all'];
SINR_iter = powerAllocation(k_cluster, users_dist,k_index,powers_all,k_means_center,height_all);
end
%--------------End
%%%%%%%%%%%%%%%%
break
end
end
end %While end
if stop == 1
break
end
end
%rate(iter) = rate(iter)/k_cluster;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%---------------Calculate the new SINR
Rate2_low = powerAllocation(k_cluster, users_dist,k_index,powers_collect(size(powers_collect,1)-1,:),k_means_center,height_all);
figure; hold on; plot(rate,'r*');plot(Rate2_low,'ko');
figure; hold on;
for power_iter = 1:size(powers_collect)
plot(powers_collect(power_iter,:),'.','MarkerSize',18)
end