第十章 多元分析

第一节 聚类分析

介绍
这里是司守奎教授的《数学建模算法与应用》全书案例代码python实现,欢迎加入此项目将其案例代码用python实现
作者:STL_CC
由于作者还是大一学生,才疏学浅,难免会有错误,欢迎指正
同时作者精力有限,希望更多大佬加入此项目,一来可以提高建模水平,二来可以分享建模经验

系统聚类法

例10.1
设有5个销售员python数学建模算法与应用电子版 数学建模python教程_多元分析,python数学建模算法与应用电子版 数学建模python教程_python_02,python数学建模算法与应用电子版 数学建模python教程_python数学建模算法与应用电子版_03,python数学建模算法与应用电子版 数学建模python教程_多元分析_04,python数学建模算法与应用电子版 数学建模python教程_美国大学生数学建模竞赛_05,他们的销售业绩由二维变量(python数学建模算法与应用电子版 数学建模python教程_多元分析_06,python数学建模算法与应用电子版 数学建模python教程_多元分析_07)描述,见表1 销售员业绩表

销售员

(销售量)百件

(回收款项)万元

1

0

1

1

3

2

4

3

2

5

方法一 系统聚类法中的最长距离法

import numpy as np
import numpy.matlib
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist,squareform
from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
a = np.array([[1,0],[1,1],[3,2],[4,3],[2,5]])
(m,n)=a.shape
d = np.zeros((m,m))
for i in range(m):
    for j in range(i+1,m):
        d[i,j]=sum(map(abs,(a[i,:]-a[j,:])))
nd=d.ravel()[np.flatnonzero(d)]
nd=np.unique(nd)
l=len(nd)
for i in range(0,l):
    nd_min=min(nd)
    (row,col)=np.where(d==nd_min)
    tm=np.union1d(row,col)
    print('第{:d}次合成,平台高度为{}时的分类结果为:{}'.format(i+1,nd_min,tm))
    nd=np.delete(nd,np.where(nd==min(nd)))
    if(not len(nd)):
        break
第1次合成,平台高度为1.0时的分类结果为:[0 1]
第2次合成,平台高度为2.0时的分类结果为:[2 3]
第3次合成,平台高度为3.0时的分类结果为:[1 2]
第4次合成,平台高度为4.0时的分类结果为:[0 2 3 4]
第5次合成,平台高度为5.0时的分类结果为:[1 3 4]
第6次合成,平台高度为6.0时的分类结果为:[0 3 4]

方法二 使用python的scipy库

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist,squareform
from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
a = np.array([[1,0],[1,1],[3,2],[4,3],[2,5]])
y=pdist(a,'cityblock')
yc=squareform(y)
z=linkage(y)
dendrogram(z)
{'icoord': [[15.0, 15.0, 25.0, 25.0],
  [35.0, 35.0, 45.0, 45.0],
  [20.0, 20.0, 40.0, 40.0],
  [5.0, 5.0, 30.0, 30.0]],
 'dcoord': [[0.0, 1.0, 1.0, 0.0],
  [0.0, 2.0, 2.0, 0.0],
  [1.0, 3.0, 3.0, 2.0],
  [0.0, 4.0, 4.0, 3.0]],
 'ivl': ['4', '0', '1', '2', '3'],
 'leaves': [4, 0, 1, 2, 3],
 'color_list': ['g', 'r', 'b', 'b']}

python数学建模算法与应用电子版 数学建模python教程_数学建模_15

T = fcluster(z,3,'maxclust')
for i in range(1,3+1):
    tm=np.where(T==i)
    print('第{}类的有'.format(i),tm)
第1类的有 (array([0, 1], dtype=int64),)
第2类的有 (array([2, 3], dtype=int64),)
第3类的有 (array([4], dtype=int64),)

变量聚类法

例10.2 服装标准制定中的变量聚类法。
在服装标准制定中,对某地成年女子的各部位尺寸进行了统计,通过14个部位的测量资料,获得各因素之间的相关系数表(见表5)。
注:具体资料见教材,部分数据将在程序中给出

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist,squareform
from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
a=pd.read_csv('ch.txt',header=None,sep='	')
a



0

1

2

3

4

5

6

7

8

9

10

11

12

13

0

1.000

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

1

0.366

1.000

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

2

0.242

0.233

1.000

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

3

0.280

0.194

0.590

1.000

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

4

0.360

0.324

0.476

0.435

1.000

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

5

0.282

0.262

0.483

0.470

0.452

1.000

NaN

NaN

NaN

NaN

NaN

NaN

NaN

NaN

6

0.245

0.265

0.540

0.478

0.535

0.663

1.000

NaN

NaN

NaN

NaN

NaN

NaN

NaN

7

0.448

0.345

0.452

0.404

0.431

0.322

0.266

1.000

NaN

NaN

NaN

NaN

NaN

NaN

8

0.486

0.367

0.365

0.357

0.429

0.283

0.287

0.820

1.000

NaN

NaN

NaN

NaN

NaN

9

0.648

0.662

0.216

0.032

0.429

0.283

0.263

0.527

0.547

1.000

NaN

NaN

NaN

NaN

10

0.689

0.671

0.243

0.313

0.430

0.302

0.294

0.520

0.558

0.957

1.000

NaN

NaN

NaN

11

0.486

0.636

0.174

0.243

0.375

0.296

0.255

0.403

0.417

0.857

0.852

1.000

NaN

NaN

12

0.133

0.153

0.732

0.477

0.339

0.392

0.446

0.266

0.241

0.054

0.099

0.055

1.000

NaN

13

0.376

0.252

0.676

0.581

0.441

0.447

0.440

0.424

0.372

0.363

0.376

0.321

0.627

1.0

d=1-abs(a)
for i in range(len(d)):
    for j in range(i+1,len(d)):
        d.iloc[i,j]=d.iloc[j,i]
z=linkage(d,'complete')
y=fcluster(z,2,'maxclust')
dendrogram(z)
{'icoord': [[25.0, 25.0, 35.0, 35.0],
  [15.0, 15.0, 30.0, 30.0],
  [5.0, 5.0, 22.5, 22.5],
  [55.0, 55.0, 65.0, 65.0],
  [45.0, 45.0, 60.0, 60.0],
  [13.75, 13.75, 52.5, 52.5],
  [85.0, 85.0, 95.0, 95.0],
  [75.0, 75.0, 90.0, 90.0],
  [125.0, 125.0, 135.0, 135.0],
  [115.0, 115.0, 130.0, 130.0],
  [105.0, 105.0, 122.5, 122.5],
  [82.5, 82.5, 113.75, 113.75],
  [33.125, 33.125, 98.125, 98.125]],
 'dcoord': [[0.0, 0.29809729955167313, 0.29809729955167313, 0.0],
  [0.0, 0.4028870809544531, 0.4028870809544531, 0.29809729955167313],
  [0.0, 0.7535436284648688, 0.7535436284648688, 0.4028870809544531],
  [0.0, 0.28955310393777517, 0.28955310393777517, 0.0],
  [0.0, 0.9263082640244553, 0.9263082640244553, 0.28955310393777517],
  [0.7535436284648688,
   1.158768311613672,
   1.158768311613672,
   0.9263082640244553],
  [0.0, 0.49697082409332655, 0.49697082409332655, 0.0],
  [0.0, 0.8415325305655154, 0.8415325305655154, 0.49697082409332655],
  [0.0, 0.5600321419347287, 0.5600321419347287, 0.0],
  [0.0, 0.7182241989796779, 0.7182241989796779, 0.5600321419347287],
  [0.0, 0.850338756026091, 0.850338756026091, 0.7182241989796779],
  [0.8415325305655154,
   1.2178604189314965,
   1.2178604189314965,
   0.850338756026091],
  [1.158768311613672,
   2.1077737070188536,
   2.1077737070188536,
   1.2178604189314965]],
 'ivl': ['1',
  '11',
  '9',
  '10',
  '0',
  '7',
  '8',
  '4',
  '5',
  '6',
  '12',
  '3',
  '2',
  '13'],
 'leaves': [1, 11, 9, 10, 0, 7, 8, 4, 5, 6, 12, 3, 2, 13],
 'color_list': ['g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'r',
  'r',
  'r',
  'r',
  'r',
  'r',
  'b']}

python数学建模算法与应用电子版 数学建模python教程_多元分析_16

说明
1.在教材原来的matlab程序的数据预处理中,原始数据被去掉了0,向量长度不一。然而这样长度不一的向量是不能在python中被处理的。笔者将其转换为了一个对称矩阵再行操作,更能适应python操作。
2.在matlab中,相关系数矩阵计算使用的corrcoef列指标是变量,然而np.corrcoef行指标是向量,所以要先转置
参考文献
1.python中scipy库实现计算各种距离 2.Python之向量(Vector)距离矩阵计算 3.scipy.spatial.distance.pdist官方文档 4.使用Python进行层次聚类(一)——基本使用+主成分分析绘图观察结果+绘制热图 5.python的scipy层次聚类参数详解 6.皮尔逊积矩相关系数–NumPy–corrcoef()

案例研究

聚类分析案例—我国各地区普通高等教育发展状况分析
具体资料见教材,部分数据将在程序中给出

R型聚类分析

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist,squareform
from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
gj=pd.read_csv('gj.txt',header=None,sep='	').T
r=np.corrcoef(gj)
d=1-r
d=d.T
z=linkage(d,'average')
dendrogram(z)
D:\Program Files\anaconda\lib\site-packages\ipykernel_launcher.py:4: ClusterWarning: scipy.cluster: The symmetric non-negative hollow observation matrix looks suspiciously like an uncondensed distance matrix
  after removing the cwd from sys.path.





{'icoord': [[5.0, 5.0, 15.0, 15.0],
  [55.0, 55.0, 65.0, 65.0],
  [85.0, 85.0, 95.0, 95.0],
  [75.0, 75.0, 90.0, 90.0],
  [60.0, 60.0, 82.5, 82.5],
  [45.0, 45.0, 71.25, 71.25],
  [35.0, 35.0, 58.125, 58.125],
  [25.0, 25.0, 46.5625, 46.5625],
  [10.0, 10.0, 35.78125, 35.78125]],
 'dcoord': [[0.0, 0.8525484864006131, 0.8525484864006131, 0.0],
  [0.0, 0.024863305626884467, 0.024863305626884467, 0.0],
  [0.0, 0.02757546216575882, 0.02757546216575882, 0.0],
  [0.0, 0.04009972160627508, 0.04009972160627508, 0.02757546216575882],
  [0.024863305626884467,
   0.14222971620143948,
   0.14222971620143948,
   0.04009972160627508],
  [0.0, 0.3129510501346711, 0.3129510501346711, 0.14222971620143948],
  [0.0, 0.4732207840475411, 0.4732207840475411, 0.3129510501346711],
  [0.0, 1.0379457746449512, 1.0379457746449512, 0.4732207840475411],
  [0.8525484864006131,
   1.7688040008982475,
   1.7688040008982475,
   1.0379457746449512]],
 'ivl': ['6', '7', '9', '8', '0', '4', '5', '1', '2', '3'],
 'leaves': [6, 7, 9, 8, 0, 4, 5, 1, 2, 3],
 'color_list': ['g', 'r', 'r', 'r', 'r', 'r', 'r', 'r', 'b']}

python数学建模算法与应用电子版 数学建模python教程_python_17

T=fcluster(z,6,'maxclust')
for i in range(1,6+1):
    tm=np.where(T==i)
    print('第{}类的有'.format(i),tm)

Q型聚类分析

第1类的有 (array([ 1,  2,  5,  6,  9, 14, 16, 20, 22, 27, 28], dtype=int64),)
第2类的有 (array([11, 17, 19], dtype=int64),)
第3类的有 (array([ 3,  4,  7,  8, 10, 12, 13, 15, 18, 23, 25, 26], dtype=int64),)
第4类的有 (array([ 0, 29], dtype=int64),)
第5类的有 (array([24], dtype=int64),)
第6类的有 (array([21], dtype=int64),)
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from scipy.spatial.distance import pdist,squareform
from scipy.cluster.hierarchy import dendrogram, linkage,fcluster
from scipy.stats import zscore
gj=pd.read_csv('gj.txt',header=None,sep='	')
gj.drop([2,3,4,5],axis=1,inplace=True)
gj=zscore(gj)
y=pdist(gj)
z=linkage(y,'average')
dendrogram(z)
{'icoord': [[25.0, 25.0, 35.0, 35.0],
  [55.0, 55.0, 65.0, 65.0],
  [45.0, 45.0, 60.0, 60.0],
  [115.0, 115.0, 125.0, 125.0],
  [105.0, 105.0, 120.0, 120.0],
  [95.0, 95.0, 112.5, 112.5],
  [135.0, 135.0, 145.0, 145.0],
  [175.0, 175.0, 185.0, 185.0],
  [165.0, 165.0, 180.0, 180.0],
  [155.0, 155.0, 172.5, 172.5],
  [140.0, 140.0, 163.75, 163.75],
  [205.0, 205.0, 215.0, 215.0],
  [195.0, 195.0, 210.0, 210.0],
  [151.875, 151.875, 202.5, 202.5],
  [103.75, 103.75, 177.1875, 177.1875],
  [245.0, 245.0, 255.0, 255.0],
  [235.0, 235.0, 250.0, 250.0],
  [225.0, 225.0, 242.5, 242.5],
  [285.0, 285.0, 295.0, 295.0],
  [275.0, 275.0, 290.0, 290.0],
  [265.0, 265.0, 282.5, 282.5],
  [233.75, 233.75, 273.75, 273.75],
  [140.46875, 140.46875, 253.75, 253.75],
  [85.0, 85.0, 197.109375, 197.109375],
  [75.0, 75.0, 141.0546875, 141.0546875],
  [52.5, 52.5, 108.02734375, 108.02734375],
  [30.0, 30.0, 80.263671875, 80.263671875],
  [15.0, 15.0, 55.1318359375, 55.1318359375],
  [5.0, 5.0, 35.06591796875, 35.06591796875]],
 'dcoord': [[0.0, 2.165069616081611, 2.165069616081611, 0.0],
  [0.0, 1.2583609996700658, 1.2583609996700658, 0.0],
  [0.0, 1.8417771551214015, 1.8417771551214015, 1.2583609996700658],
  [0.0, 0.5102264872107092, 0.5102264872107092, 0.0],
  [0.0, 0.6832038743184952, 0.6832038743184952, 0.5102264872107092],
  [0.0, 1.0534720455396414, 1.0534720455396414, 0.6832038743184952],
  [0.0, 0.5781192315086323, 0.5781192315086323, 0.0],
  [0.0, 0.39279767533924836, 0.39279767533924836, 0.0],
  [0.0, 0.6920728341799272, 0.6920728341799272, 0.39279767533924836],
  [0.0, 0.8587702247881565, 0.8587702247881565, 0.6920728341799272],
  [0.5781192315086323,
   0.951351143405573,
   0.951351143405573,
   0.8587702247881565],
  [0.0, 0.5422575852550412, 0.5422575852550412, 0.0],
  [0.0, 0.9931260197301512, 0.9931260197301512, 0.5422575852550412],
  [0.951351143405573,
   1.1337053131831187,
   1.1337053131831187,
   0.9931260197301512],
  [1.0534720455396414,
   1.3091848224406866,
   1.3091848224406866,
   1.1337053131831187],
  [0.0, 0.7003307018910611, 0.7003307018910611, 0.0],
  [0.0, 0.8455264592324916, 0.8455264592324916, 0.7003307018910611],
  [0.0, 1.183614496022609, 1.183614496022609, 0.8455264592324916],
  [0.0, 0.7408663414479949, 0.7408663414479949, 0.0],
  [0.0, 0.7928025480682006, 0.7928025480682006, 0.7408663414479949],
  [0.0, 1.43244903784588, 1.43244903784588, 0.7928025480682006],
  [1.183614496022609,
   1.4564970761510314,
   1.4564970761510314,
   1.43244903784588],
  [1.3091848224406866,
   1.7995913531962198,
   1.7995913531962198,
   1.4564970761510314],
  [0.0, 2.389167635724778, 2.389167635724778, 1.7995913531962198],
  [0.0, 2.4533056587669138, 2.4533056587669138, 2.389167635724778],
  [1.8417771551214015,
   2.801906683718659,
   2.801906683718659,
   2.4533056587669138],
  [2.165069616081611, 4.23103702123038, 4.23103702123038, 2.801906683718659],
  [0.0, 5.540472787063805, 5.540472787063805, 4.23103702123038],
  [0.0, 8.531300289191663, 8.531300289191663, 5.540472787063805]],
 'ivl': ['0',
  '24',
  '1',
  '2',
  '29',
  '27',
  '28',
  '3',
  '9',
  '20',
  '16',
  '14',
  '22',
  '13',
  '19',
  '21',
  '26',
  '18',
  '25',
  '17',
  '15',
  '23',
  '5',
  '6',
  '4',
  '7',
  '12',
  '10',
  '8',
  '11'],
 'leaves': [0,
  24,
  1,
  2,
  29,
  27,
  28,
  3,
  9,
  20,
  16,
  14,
  22,
  13,
  19,
  21,
  26,
  18,
  25,
  17,
  15,
  23,
  5,
  6,
  4,
  7,
  12,
  10,
  8,
  11],
 'color_list': ['g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'g',
  'b']}

python数学建模算法与应用电子版 数学建模python教程_python_18

for k in range(3,6):
    print('划分成{}类的结果如下'.format(k))
    T=fcluster(z,k,'maxclust')
    for i in range(1,k+1):
        tm=np.where(T==i)
        print('第{}类的有{}'.format(i,tm))
划分成3类的结果如下
第1类的有(array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 25, 26, 27, 28, 29], dtype=int64),)
第2类的有(array([24], dtype=int64),)
第3类的有(array([0], dtype=int64),)
划分成4类的结果如下
第1类的有(array([1, 2], dtype=int64),)
第2类的有(array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
       20, 21, 22, 23, 25, 26, 27, 28, 29], dtype=int64),)
第3类的有(array([24], dtype=int64),)
第4类的有(array([0], dtype=int64),)
划分成5类的结果如下
第1类的有(array([1, 2], dtype=int64),)
第2类的有(array([27, 28, 29], dtype=int64),)
第3类的有(array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
       20, 21, 22, 23, 25, 26], dtype=int64),)
第4类的有(array([24], dtype=int64),)
第5类的有(array([0], dtype=int64),)