citespace聚类llr算法分析表

转载

mob6454cc6f4a4e 2024-09-06 22:12:52

文章标签 citespace聚类llr算法分析表学习聚类机器学习 sed 文章分类 架构后端开发

1、聚类分析的目的是根据相似性将元素分类，该文提出了一种基于聚类中心的方法，其特点是聚类中心的密度高于其邻居，并且密度较高的点之间的距离相对较大。

2、经典的聚类算法k-means聚类是通过指定的聚类中心，再通过迭代的方式更新聚类中心的方式，由于每个点都被指派到距离最近的聚类中心，所以导致其不能检测非球面类别的数据分布。虽然有DBSCAN(Density-Based Spatial Clustering of Applications Noise)对于任意形状分布的数据进行聚类，但是必须指定一个密度阈值，从而去除低于此密度阈值的噪音点。

3、基于数据点的局部密度的方法可以很容易地检测到具有任意形状的聚类。

4、该算法的基础是假设聚类中心被局部密度较低的邻居包围，并且它们与局部密度较高的点之间的距离相对较大。对于每个数据点，我们计算两个量：它的局部密度和它到密度最高点的距离。这两个量只取决于数据点之间的距离。假设距离 $citespace聚类llr算法分析表_citespace聚类llr算法分析表$ 满足三角不等式，定义数据点的局部密度 $citespace聚类llr算法分析表_sed_02$ 为：

$citespace聚类llr算法分析表_聚类_03$
其中，如果 $citespace聚类llr算法分析表_学习_04$ ，则 $citespace聚类llr算法分析表_学习_05$ ，否则 $citespace聚类llr算法分析表_sed_06$ ， $citespace聚类llr算法分析表_机器学习_07$ 是局部密度阈值。

$citespace聚类llr算法分析表_机器学习_07$ 的计算方法：

def select_dc(max_id, max_dis, min_dis, distances, auto=False):
    '''
    Select the local density threshold, default is the method used in paper, auto is `autoselect_dc`

    Args:
            max_id    : max continues id
            max_dis   : max distance for all points
            min_dis   : min distance for all points
            distances : distance dict
            auto      : use auto dc select or not

    Returns:
        dc that local density threshold
    '''
    logger.info("PROGRESS: select dc")
    if auto:
        return autoselect_dc(max_id, max_dis, min_dis, distances)
    percent = 2.0
    position = int(max_id * (max_id + 1) / 2 * percent / 100)
    dc = sorted(distances.values())[position * 2 + max_id]
    logger.info("PROGRESS: dc - " + str(dc))
    return dc


def autoselect_dc(max_id, max_dis, min_dis, distances):
    '''
    Auto select the local density threshold that let average neighbor is 1-2 percent of all nodes.

    Args:
            max_id    : max continues id
            max_dis   : max distance for all points
            min_dis   : min distance for all points
            distances : distance dict

    Returns:
        dc that local density threshold
    '''
    dc = (max_dis + min_dis) / 2

    while True:
        nneighs = sum([1 for v in distances.values() if v < dc]) / max_id ** 2
        if nneighs >= 0.01 and nneighs <= 0.02:
            break
        # binary search
        if nneighs < 0.01:
            min_dis = dc
        else:
            max_dis = dc
        dc = (max_dis + min_dis) / 2
        if max_dis - min_dis < 0.0001:
            break
    return dc

局部密度的计算方法：

def local_density(max_id, distances, dc, guass=True, cutoff=False):
    '''
    Compute all points' local density

    Args:
            max_id    : max continues id
            distances : distance dict
            gauss     : use guass func or not(can't use together with cutoff)
            cutoff    : use cutoff func or not(can't use together with guass)

    Returns:
        local density vector that index is the point index that start from 1
    '''
    assert guass ^ cutoff
    logger.info("PROGRESS: compute local density")
    guass_func = lambda dij, dc: math.exp(- (dij / dc) ** 2)
    cutoff_func = lambda dij, dc: 1 if dij < dc else 0
    func = guass and guass_func or cutoff_func
    rho = [-1] + [0] * max_id
    for i in range(1, max_id):
        for j in range(i + 1, max_id + 1):
            rho[i] += func(distances[(i, j)], dc)
            rho[j] += func(distances[(i, j)], dc)
        if i % (max_id / 10) == 0:
            logger.info("PROGRESS: at index #%i" % (i))
    return np.array(rho, np.float32)

5、算法流程
对于每一个数据点，要计算两个量:点的局部密度和该点到具有更高局部密度的点的距离，而这两个值都取决于数据点间的距离。