前言

kmeans是最简单的聚类算法之一,但是运用十分广泛。最近在工作中也经常遇到这个算法。kmeans一般在数据分析前期使用,选取适当的k,将数据分类后,然后分类研究不同聚类下数据的特点。

本文记录学习kmeans算法相关的内容,包括算法原理,收敛性,效果评估聚,最后带上R语言的例子,作为备忘。

算法原理

kmeans的计算方法如下:

1 随机选取k个中心点

2 遍历所有数据,将每个数据划分到最近的中心点中

3 计算每个聚类的平均值,并作为新的中心点

4 重复2-3,直到这k个中线点不再变化(收敛了),或执行了足够多的迭代

时间复杂度:O(I*n*k*m)

空间复杂度:O(n*m)

其中m为每个元素字段个数,n为数据量,I为跌打个数。一般I,k,m均可认为是常量,所以时间和空间复杂度可以简化为O(n),即线性的。

算法收敛

从kmeans的算法可以发现,SSE其实是一个严格的坐标下降(Coordinate Decendet)过程。设目标函数SSE如下:

SSE(

【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_布局优化

,

【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_matlab_02

,…,

【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_布局优化_03

) =

【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_布局优化_04

采用欧式距离作为变量之间的聚类函数。每次朝一个变量

 

的方向找到最优解,也就是求偏倒数,然后等于0,可得

c_i=

【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_matlab_05

其中m是c_i所在的簇的元素的个数

也就是当前聚类的均值就是当前方向的最优解(最小值),这与kmeans的每一次迭代过程一样。所以,这样保证SSE每一次迭代时,都会减小,最终使SSE收敛。

由于SSE是一个非凸函数(non-convex function),所以SSE不能保证找到全局最优解,只能确保局部最优解。但是可以重复执行几次kmeans,选取SSE最小的一次作为最终的聚类结果。

0-1规格化

由于数据之间量纲的不相同,不方便比较。举个例子,比如游戏用户的在线时长和活跃天数,前者单位是秒,数值一般都是几千,而后者单位是天,数值一般在个位或十位,如果用这两个变量来表征用户的活跃情况,显然活跃天数的作用基本上可以忽略。所以,需要将数据统一放到0~1的范围,将其转化为无量纲的纯数值,便于不同单位或量级的指标能够进行比较和加权。具体计算方法如下:

其中【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_matlab_06

【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_布局优化_07

属于A。

轮廓系数

轮廓系数(Silhouette Coefficient)结合了聚类的凝聚度(Cohesion)和分离度(Separation),用于评估聚类的效果。该值处于-1~1之间,值越大,表示聚类效果越好。具体计算方法如下:

  1. 对于第i个元素x_i,计算x_i与其同一个簇内的所有其他元素距离的平均值,记作a_i,用于量化簇内的凝聚度。
  2. 选取x_i外的一个簇b,计算x_i与b中所有点的平均距离,遍历所有其他簇,找到最近的这个平均距离,记作b_i,用于量化簇之间分离度。
  3. 对于元素x_i,轮廓系数s_i = (b_i – a_i)/max(a_i,b_i)
  4. 计算所有x的轮廓系数,求出平均值即为当前聚类的整体轮廓系数

从上面的公式,不难发现若s_i小于0,说明x_i与其簇内元素的平均距离小于最近的其他簇,表示聚类效果不好。如果a_i趋于0,或者b_i足够大,那么s_i趋近与1,说明聚类效果比较好。

K值选取

在实际应用中,由于Kmean一般作为数据预处理,或者用于辅助分类贴标签。所以k一般不会设置很大。可以通过枚举,令k从2到一个固定值如10,在每个k值上重复运行数次kmeans(避免局部最优解),并计算当前k的平均轮廓系数,最后选取轮廓系数最大的值对应的k作为最终的集群数目。

 
%%%%%%%%%%%Add cell breathing

%The UAV coverage radius is R_m = 100m

%The area is 100*100m

clear;clc;

lengthOfArea =400;

R_m = 100;

n_users = 50;

users_dist = rand(n_users,2) * lengthOfArea; % The coordinate of every user

min_users_inaUAV = 5;

max_users_in_aUAV = 10;

power_Trans = 40;

R_cov_all = 0;

iteration = 10;

rate = zeros(iteration,1);

SINR_threshold = -5;% The threshold is dm

Rate2_low = [];

powers_collect = [];

%plot(users_dist(:,1),users_dist(:,2)3'r*');

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%===================Kmeans===========

stop=0;

k_index_old = 0;

powers_all_old = 0;

%%%%%%% Kmeans tryting

for k_cluster = 3:40;

change = 1;

while change == 1;

height_all = zeros(k_cluster,1);

[k_index,k_center] = kmeans(users_dist,k_cluster);

if k_index == k_index_old

change = 0;

end

k_index_old = k_index;





%===================================

%%%%%%%%%%%%%%%%%%%

%----------In order to avoid too much overlapped area, I modify Kmeans by

%-----------------selecting the centroid from the middle of farthest users in

%-----------------one cluster

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%=========find the farthest distance in one cluster

max_k_all = [];

k_means_center = [];

rate = [];

users_data=[];

for k_k = 1:k_cluster

max_all = [];

users_in_acluster = zeros(size(users_dist(k_index==k_k,:),1),2);

users_in_acluster = users_dist(k_index==k_k,:);

for user_index = 1:size(users_in_acluster,1)

[max_dista, max_index]= max((users_in_acluster(user_index,1) - users_in_acluster(:,1)).^2+(users_in_acluster(user_index,2) - users_in_acluster(:,2)).^2);

max_all = [max_all;[sqrt(max_dista), user_index, max_index]];

end

[max_k,max_index]= max(max_all(:,1));

max_index_in_acluster = max_all(max_index,2:3);

coor_center = sum(users_in_acluster(max_index_in_acluster,:),1)/2;

radiusOfCenter = norm(users_in_acluster(max_index_in_acluster(1),:)-users_in_acluster(max_index_in_acluster(2),:))/2;

k_means_center = [k_means_center; coor_center radiusOfCenter];%store the locations and radius of each cluster

end

%---------------------------------The locations of clusters have been fixed----------------------

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%------------------The overlapped area---------------------------------

all_overlap_area = 0;

all_area = 0;

for k = 1:length(k_means_center)-1

for k_left = k+1:length(k_means_center)

radius_two_cluster = sqrt(sum((k_means_center(k,1:2)-k_means_center(k_left,1:2)).^2));

if radius_two_cluster < (k_means_center(k,3)+k_means_center(k_left,3))

r_1 = k_means_center(k,3); r_2 = k_means_center(k_left,3); D = radius_two_cluster;

theta_1 = acos((r_1^2+D^2-r_2^2)/(2*r_1*D));

theta_2 = acos((r_2^2+D^2-r_1^2)/(2*r_2*D));

overlap_area = theta_1*r_1^2 + theta_2*r_2^2-r_1*D*sin(theta_1);

all_overlap_area = all_overlap_area+overlap_area;

end

end

end

%---------------------------------------------------------------------

%r_distance = zeros(length(users_dist),length(k_means_center));

users_sum = 0;

%%%%%%%%%%%%%%%%%%%%%%

%--------Cell breathing to constrain the number of users.

users_overlap = 0;

for k = 1:k_cluster

r_distance = 0;

users_in_cell = users_dist(k_index==k,:);

for k_users = 1: size(users_in_cell,1)

r_distance = norm(users_in_cell(k_users,:)-k_means_center(k,1:2));

if r_distance>k_means_center(k,3)

k_means_center(k,3) = r_distance;

end


end

end

%--------------cell breathing

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%计算所有重叠部分的用户数

for k = 1:k_cluster

users_in_cell = users_dist(k_index==k,:);

for k_users = 1: size(users_in_cell,1)

for k_cluster_else = 1:k_cluster

if k_cluster_else~=k && norm(users_in_cell(k_users,:)-k_means_center(k_cluster_else,1:2))<k_means_center(k_cluster_else,3)

users_overlap = users_overlap+1;

end

end

end

end


% if length(users_in_cluster)>= max_users_in_aUAV

% [~,dis_label] = sort(r_distance(:,k));

% k_means_center(k,3) = r_distance(dis_label(max_users_in_aUAV),k);

% end

% users_sum = users_sum + sum(r_distance(:,k)<=k_means_center(k,3));


%--------------This function has been closed.

%%%%%%%%%%%%%%%%%%%%%%



%--------------For Power Allocation

index_users = users_data(:,1);

for cluster_for_p = 1:k_cluster

users_PL_cluster_p = users_data(index_users==cluster_for_p,2:end);

SINR_cluster = [];

for user_in_c_p = 1:size(users_PL_cluster_p,1)

SINR_interf = 0;

SINR_signal = powers_all(cluster_for_p) - users_PL_cluster_p(user_in_c_p,1);

count = 1;

for user_intf = 1:k_cluster

if user_intf~=cluster_for_p

count = count+1;

SINR_interf = SINR_interf + 10.^((powers_all(cluster_for_p)-users_PL_cluster_p(user_in_c_p,count))/10);

end

end

SINR_user = SINR_signal- 10*log10(SINR_interf);

SINR_cluster = [SINR_cluster; SINR_user 10*log10(SINR_interf)];


end

[min_SINR, min_user] = min(SINR_cluster(:,1));

powers_all(cluster_for_p) = SINR_threshold + SINR_cluster(min_user,2) + users_PL_cluster_p(min_user,1);

end

powers_collect = [powers_collect; powers_all'];

SINR_iter = powerAllocation(k_cluster, users_dist,k_index,powers_all,k_means_center,height_all);

end

%--------------End

%%%%%%%%%%%%%%%%

break

end

end

end %While end


if stop == 1

break

end

end

%rate(iter) = rate(iter)/k_cluster;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%---------------Calculate the new SINR

Rate2_low = powerAllocation(k_cluster, users_dist,k_index,powers_collect(size(powers_collect,1)-1,:),k_means_center,height_all);

figure; hold on; plot(rate,'r*');plot(Rate2_low,'ko');

figure; hold on;

for power_iter = 1:size(powers_collect)

plot(powers_collect(power_iter,:),'.','MarkerSize',18)

end

【无人机布局优化】基于k-mean聚类的无人机布局优化matlab源码_布局优化_08