医学图像分析 Medical Image Analysis

翻译

mb63816ca2ee95f 2022-12-18 14:51:38

文章标签 sed 数据数据集 文章分类 深度学习人工智能 Word文档导入

医学图像分析

Medical Image Analysis

大规模标注图像数据集的可用性和监督深度学习方法的最新进展使端到端推导代表性图像特征，可以影响各种图像分析问题。然而，这种有监督的方法很难在医学领域实施，因为由于人工注释的复杂性和观察者间和观察者内部标签分配的差异性，难以获得大量的标签数据。我们提出了一种新的卷积稀疏核网络（CSKN），这是一种分层的无监督特征学习框架，解决了在缺乏注释训练数据的医学图像分析领域中学习具有代表性的视觉特征的挑战。我们的框架有三个贡献： (i)我们扩展了内核学习，以一种无监督的方式来识别和表示跨图像子补丁的不变特征。（ii）我们用一个分层的预训练方案来初始化我们的内核学习，该方案利用医学图像中固有的稀疏性来提取初始的鉴别特征。（iii）我们采用了一个多尺度的空间金字塔p

空间金字塔池（SPP）框架，以捕捉学习到的视觉特征之间的细微几何差异。我们在三个公共数据集上评估了我们的医学图像检索和分类框架。我们的结果表明，与其他传统的无监督方法相比，我们的CSKN具有更好的准确性，并且与使用最先进的监督卷积神经网络（CNNs）的方法相比的准确性相当。我们的研究结果表明，我们的无监督的CSKN提供了一个利用医学成像存储库中的无注释的大数据的机会。©2019爱思唯尔B.V案保留所有权利。

a b s t r a c t

The availability of large-scale annotated image datasets and recent advances in supervised deep learning methods enable the end-to-end derivation of representative image features that can impact a variety of image analysis problems. Such supervised approaches, however, are difficult to implement in the medical domain where large volumes of labelled data are difficult to obtain due to the complexity of manual annotation and inter- and intra-observer variability in label assignment. We propose a new convolutional sparse kernel network (CSKN), which is a hierarchical unsupervised feature learning framework that addresses the challenge of learning representative visual features in medical image analysis domains where there is a lack of annotated training data. Our framework has three contributions: (i) we extend kernel learning to identify and represent invariant features across image sub-patches in an unsupervised manner. (ii) We initialise our kernel learning with a layer-wise pre-training scheme that leverages the sparsity inherent in medical images to extract initial discriminative features. (iii) We adapt a multi-scale spatial pyramid pooling (SPP) framework to capture subtle geometric differences between learned visual features. We evaluated our framework in medical image retrieval and classification on three public datasets. Our results show that our CSKN had better accuracy when compared to other conventional unsupervised methods and comparable accuracy to methods that used state-of-the-art supervised convolutional neural networks (CNNs). Our findings indicate that our unsupervised CSKN provides an opportunity to leverage unannotated big data in medical imaging repositories. © 2019 Elsevier B.V. All rights reserved.

1. 医学成像现在在现代医疗保健中无处不在，因为它为患者的诊断和管理提供了宝贵的数据。目前大多数的医学成像数据都是数字的，并存储在大量的成像存储库中。这些存储库或档案为循证和计算机辅助诊断、医生培训和生物医学研究提供了新的机会（Litjens等人，2017年；Kumar等人，2013年）。计算机辅助诊断系统（CADs）可以自动分析、分类和检索图像，通过使用机器学习方法将低级图像特征与高级语义概念或专家领域知识联系起来。这些监督方法使用来自有标记的训练数据和方法的先验知识，如卷积神经网络-

1. Introduction

Medical imaging is now ubiquitous in modern healthcare because it provides invaluable data for patient diagnosis and management. Most current medical imaging data are digital and stored

in vast imaging repositories. These repositories or archives provide

new opportunities for evidence-based and computer-aided diagnosis, physician training and biomedical research (Litjens et al., 2017;

Kumar et al., 2013). Computer-aided diagnosis systems (CADs) can

automatically analyse, categorise, and retrieve images, by relating

low-level image features to high-level semantic concepts or expert domain knowledge using machine learning approaches. These

supervised approaches use prior knowledge derived from labelled

training data and approaches, such as convolutional neural net-

作品（CNNs）在自然（摄影）图像分类方面取得了令人印象深刻的成果（西蒙尼扬和齐瑟曼，2014年；He等人，2016年；Szegedy等人，2015年）。cnn以一种分层的方式学习图像特征。网络的每一层都学习高级和语义上更有意义的图像数据的表示。例如，在图像分类中，学习到的特征可以是一种特定于类别的表示（Le，2013），以便更好地区分不同的图像类别（Simonyan和齐瑟曼，2014；He等人，2016）。这些cnn需要大量的带注释的训练图像，例如，具有超过100万张自然图像的ImageNet。如此大的图像数据集在医学领域非常稀缺，因为图像难以解释，而且图像标记/注释昂贵、繁琐、缓慢，并且受临床医生观察者间和观察者内部可变性的影响（Shin et al.，2013）。

works (CNNs) have produced impressive results in natural (photographic) image classification (Simonyan and Zisserman, 2014; He

et al., 2016; Szegedy et al., 2015). CNNs learn image features in a

hierarchical fashion. Each deeper layer of the network learns a representation of the image data that is high-level and semantically

more meaningful. For example, in image classification, the learned

features can be a class-specific representation (Le, 2013) to enable

better discrimination between different image classes (Simonyan

and Zisserman, 2014; He et al., 2016). These CNNs require a large

number of annotated training images, e.g., ImageNet with over 1

million natural images. Such large image datasets are scarce in the

medical domain because the images can be difficult to interpret

and image labelling / annotation is costly, tedious, slow, and subject to clinician inter- and intra-observer variability (Shin et al.,

2013) .

引入迁移学习是为了解决大量标记的医学图像数据的缺乏，通过一个在不同的领域上预先训练的模型，例如，自然图像作为一个通用的特征提取器，或通过使用一个相对较小的数据集

Transfer learning was introduced to address the lack of large

amounts of labelled medical image data through a model that was

pre-trained on a different domain, e.g., natural images as a generic

feature extractor, or through using a relatively small dataset of

医学图像优化来自不同领域的预训练模型的方法，即微调(Kumar等人，2017年；Tajbakhsh等人，2016年；Shin等人，2016年；Bi等人，2017年）。不幸的是，这两种方法都依赖于来自不同领域的一般图像特征，而且它们无法捕获与特定数据集最相关的高级语义特征。因此，与直接从大型的、特定的注释数据中学习图像特征的方法相比，它们的准确性较低。另一种方法是使用无监督特征学习算法从无标记数据构建特征，然后允许使用无注释的图像档案（Lee等人，2006；Hinton等人，2006；奈伊尔和辛顿，2010；Le，2013；Erhan等人，2010；Romero等人，2016）。然而，这些方法中的许多只在学习低级特征如“线”或“边”方面表现出强大的表现（Lee等人，2006；Hinton等人，2006；Nair和Hinton，2010；Le，2013）。

medical images to optimise a pre-trained model from a different

domain, i.e., fine-tuning (Kumar et al., 2017; Tajbakhsh et al., 2016;

Shin et al., 2016; Bi et al., 2017). Unfortunately, both approaches

rely on general image features derived from a different domain and

they are unable to capture the high-level semantic features, which

are most relevant to a specific dataset. As a result, they have inferior accuracy when compared to approaches that learn image features directly from large, specific annotated data. An alternative approach is to use unsupervised feature learning algorithms to build

features from unlabelled data, which then allows unannotated image archives to be used (Lee et al., 2006; Hinton et al., 2006; Nair

and Hinton, 2010; Le, 2013; Erhan et al., 2010; Romero et al., 2016).

Many of these methods, however, have only shown strong performance in learning low-level features such as ‘lines’ or ‘edges’ (Lee

et al., 2006; Hinton et al., 2006; Nair and Hinton, 2010; Le, 2013).

许多无监督方法被用于预训练一个模型，该模型后来被耦合到一个监督学习阶段，即，无监督组件被用作一个训练前阶段，以获得有用的先验，作为监督微调的初始化点（Erhan等人，2010；Romero等人，2016）。因此，责任是在监督阶段学习高级和语义上有意义的图像特征。我们的目标是推导出一个框架，使其能够以一种完全无监督的方式学习语义上有意义的图像特征。

Many unsupervised methods were used to pre-train a model that

was later coupled to a supervised learning stage, i.e., the unsupervised component was used as a pre-training phase to derive useful priors that acted as an initialisation point for the supervised

fine-tuning (Erhan et al., 2010; Romero et al., 2016). Thus the onus

was on the supervised phase to learn high-level and semantically

meaningful image features. Our aim was to derive a framework

that enables learning semantically meaningful image features in a

completely unsupervised fashion.

1.1. 许多无监督特征学习方法都是基于稀疏编码（Lee等人，2006年）、稀疏自动编码器（Hinton等人，2006年）和限制玻尔兹曼机器（RBMs）（Nair和Hinton，2010年），并且仅限于学习和提取低级特征。只有少数方法，如Le等人报道的堆叠稀疏自动编码器（SSAE），其中SSAE预训练的模型与监督深度学习（即微调）耦合，才能够提取语义高级特征。高度非线性和非参数模型对无监督特征学习算法至关重要（Song et al.，2018）。核学习是一种通过再现核希尔伯特空间（RKHS）中的相似性函数来推导非线性模型的自然方法（Zhuang et al.，2011）。机器学习技术已适应于RKHS，并提高了对象识别和聚类的性能（蒂亚加拉詹等人，2014）。在最近开始时，

1.1. Related work

Many unsupervised feature learning approaches are based

on sparse coding (Lee et al., 2006), sparse auto-encoders

(Hinton et al., 2006), and Restricted Boltzmann Machines (RBMs)

(Nair and Hinton, 2010) and are limited to learning and extracting low-level features. Only a few methods, such as the stacked

sparse autoencoder (SSAE) reported by Le et al., where the SSAE

pre-trained a model was coupled to supervised deep learning (i.e.,

fine tuning), have been able to extract semantic high-level features. Highly non-linear and non-parametric models are crucial

to unsupervised feature learning algorithms (Song et al., 2018).

Kernel learning is a natural approach to derive non-linear models via a similarity function in a reproducing kernel Hilbert space

(RKHS) (Zhuang et al., 2011). Machine learning techniques have

been adapted to a RKHS and have improved performance in object recognition and clustering (Thiagarajan et al., 2014). Recently,

深度学习架构已被用于内核学习（Mairal等人，2014；Song等人，2018），在自然图像分类（Mairal等人，2014）和检索（Paulil等人，2015）方面具有最先进的性能。这些体系结构以RKHS和非线性分层的方式学习数据表示，但当训练数据较小时，它们容易发生过拟合（学习成本函数经常停留在局部最小值）。稀疏性的概念广泛应用于计算机视觉，并已证明在图像压缩中有效（斯科德拉斯等，2001)、去噪（布德斯等，2005)、断层重建、分割（阿恩等，2015、2017；张等，2012)和分类（阿恩等，2016；Jiang等人，2011年）。稀疏性可以用来使用

deep learning architectures have been used for kernel learning

(Mairal et al., 2014; Song et al., 2018) with state-of-the-art performance in natural image classification (Mairal et al., 2014) and

retrieval（Paulin et al., 2015). These architectures learned data

representations in a RKHS and a non-linear hierarchical manner, but they are prone to overfitting (the learning cost function often gets stuck in local minima) when the training data are

small.

The concept of sparsity is widely used in computer vision and

has proven effective in image compression (Skodras et al., 2001),

denoising (Buades et al., 2005), tomographic reconstruction, segmentation (Ahn et al., 2015, 2017; Zhang et al., 2012), and classifi-

cation (Ahn et al., 2016; Jiang et al., 2011). Sparsity can be used to

derive compact and optimal representations of image data, where

trivial information or parameters can be ignored without compromising image quality or characteristics (Leahy and Byrne, 2000).

最近，基于稀疏性的cnn也被应用于监督深度学习方法，以减少架构中的参数数量（Graham等人，2018；格雷厄姆和范德马顿，2017；Liu等人，2015,2018）。Liu等人（2015）能够在密集的cnn中减少90%的参数，从而使计算速度显著提高。研究表明，在医学图像数据中，特征表示在一定的固定基（如傅里叶）下具有固有的稀疏结构(Li et al.，

Recently, sparsity-based CNNs have been also applied to supervised

deep learning approaches to reduce the number of parameters in

the architecture (Graham et al., 2018; Graham and van der Maaten,

2017; Liu et al., 2015, 2018). Liu et al. (2015) were able to reduce

90% of the parameters in dense CNNs that then provided a marked

in improvement in computational speed. It has been shown that

in medical image data, feature representations have an intrinsic

sparse structure under certain fixed bases (e.g., Fourier) (Li et al.,

2012年；Lustig等人，2007年)。Lustig等人（2007）通过增加一个稀疏性约束，提高了磁共振（MR）成像的时间分辨率；这一步随后允许在心脏和大脑成像中开发一些新的CADs。这种内在的稀疏性通常以两种互补的形式出现（威尔莫尔和托尔赫斯特，2001年）：人口稀少性和终身稀疏性。种群稀疏性是指激活基的小子集（即种群的稀疏集）来编码不同的信息；对于任何给定的刺激（输入图像），只有一小部分编码输出集（特征映射或基）是活跃的，不同的子集对不同的刺激是活跃的。这确保了对不同碱基的激活是对不同图像数据的鉴别器。相比之下，寿命稀疏性是指不同输入的碱基激活频率较短（即每个碱基都有一个稀疏的寿命）；不同的碱基很少活跃，每个激活

2012; Lustig et al., 2007). Lustig et al. (2007) improved temporal resolution of magnetic resonance (MR) imaging by adding a

sparsity constraint; this step then allowed the development of a

number of novel CADs in cardiac and brain imaging. This intrinsic

sparsity often comes in two complementary forms (Willmore and

Tolhurst, 2001): population and lifetime sparsity. Population sparsity refers to the activation of small subsets of the bases (i.e., a

sparse set of the population) to encode different information; only

a small subset of the coding outputs (feature maps or bases) are

active for any given stimulus (input images), and different subsets are active for different stimuli. This ensures that the activation of different bases is a discriminator for different image data.

In contrast, lifetime sparsity refers to the short frequency of activation of bases for different inputs (i.e., each base has a sparse

lifetime); different bases are active very rarely and each activa

他的反应很高。这确保了强的罕见激活是底层图像数据中更高程度的信息（信息越高，熵值越高）的指标。基于这些发现，我们建议将稀疏性纳入到分层无监督的预训练中，将允许为医学图像数据提取更具鉴别性的特征。稀疏金字塔池（SPP）可以通过将图像划分为多层次区域并聚合局部特征来表示图像特征的空间布局（Lazebnik et al.，2006）。SPP已成功应用于图像分类(Yang et al.，2009年；Wang等人，2010年）和目标检测(Van de Sande等人，2011年）。

tion has a high response. This ensures that the strong rare activations are indicators for higher degrees of information (the higher

the information, the higher the entropy) in the underlying image data. Motivated by these findings, we suggest that incorporating sparsity into layerwise unsupervised pre-training will allow

the extraction of more discriminative features for medical image

data. Sparse pyramid pooling (SPP) can represent the spatial layout of image features by partitioning the image into multi-level

regions and aggregating local features (Lazebnik et al., 2006). SPP

has been successfully applied to image classification (Yang et al.,

2009; Wang et al., 2010) and object detection (Van de Sande et al.,

2011) .

2012)

1.2. 贡献我们设计了一个无监督的深度学习框架学习语义高级特征的医学图像，我们称之为卷积稀疏核网络（CSKN），解决学习的挑战代表视觉特征在医学图像分析缺乏注释的训练数据。我们的CSKN推导了一个建模图像相似性的核空间，它受图像固有稀疏性和不同类的局部几何性质的约束。由于我们的CSKN在核空间中表示图像，它可以以非参数的方式获得高度非线性的特征，这对无监督特征学习至关重要（Song et al.，2018）。此外，导出的核空间通过忽略平凡或冗余参数描述了成像数据的更强的区别性局部语义表示（Leahy和Byrne，2000）。我们的工作的主要贡献是：

1.2. Contribution

We have designed an unsupervised deep learning framework to learn semantic high-level features from unlabelled medical images, which we refer to as the convolutional sparse kernel network (CSKN), to address the challenge of learning representative visual features in medical image analysis where there is a lack of annotated training data. Our CSKN derives a kernel space for modelling image similarity that is constrained by the inherent image sparsity and the local geometric properties of distinct classes. Since our CSKN represents images in a kernel space it can derive features that are highly non-linear in a non-parametric manner, which is crucial for unsupervised feature learning (Song et al., 2018). Furthermore, the derived kernel space depicts a stronger discriminative local semantic representation of the imaging data by ignoring trivial or redundant parameters (Leahy and Byrne, 2000). The main contributions of our work are:

(1) 一种将核学习和cnn相结合，分层学习不变的局部特征；(2)无监督卷积稀疏特征学习算法，有效地学习RKHS中的初始鉴别特征；(3)初始化核网络的权值，然后以分层方式进行预训练，(4)结合SPP框架，提供更多的判别和几何不变的医学图像数据局部特征表示。

(1) a new approach to characterise medical images by combining kernel learning and CNNs to learn invariant local features in a hierarchical manner;

(2) an unsupervised convolutional sparse feature learning algorithm that effectively learns initial discriminative features in

a RKHS;

(3) initialising the weights of a kernel network that can then be

pre-trained in a layer-wise fashion and,

(4) incorporating a SPP framework that provides more discriminative and geometrically invariant local feature representations of medical image data.

论文的其余部分组织如下：本文中使用的(a)材料和提出的框架在第2节中介绍；实施和实验设置的(b)细节；第4节中提供了该框架与不同方法的(c)评价；(d)我们在第5节中讨论我们的发现、局限性和未来的工作，(e)我们在第6节中总结工作。

The remainder of paper is organised as follows: (a) materials used in this paper and proposed framework are introduced in Section 2; (b) details of the implementation and experimental setup are described in Section 3; (c) evaluation of

the framework in comparison to different methods is provided in

Section 4; (d) we discuss our findings, limitations and future work

in Section 5 and, (e) we summarise the work in Section 6.

2. 材料和方法2.1。数据集2.1.1。IRMAx射线数据集医学应用中的图像检索（IRMA）数据集包括14410张灰度x射线图像，包括193个层次类别（Lehmann等，2004年；莱曼等，2003年）。IRMA数据集包含具有不规则对比度、亮度和伪影的图像，具有较高的类内变异性和类间相似性。我们使用了12,677张图像的标准预定义训练集和1733张图像的测试集（Lehmann et al.，2003）。图像根据IRMA编码系统注释四个不同的轴（2003）：(1)描述成像形态的技术编码，(2)成像定向定向编码，(3)检查身体区域的解剖编码，(4)检查生物系统的生物编码。图1显示了IRMA代码的x射线图像和相应的标签

2.1.2.我们使用了2016年评估论坛（ImageCLEF）的图像会议和实验室竞赛中使用的医学子图分类数据集（GarciaSecoSecodeEerrera等人，2016年；维等人，2016年）。我们使用了6776张图像的标准预定义训练集和来自30种不同图像模式的4166张图像的测试集。对这两种图像数据集都有地面真实注释。虽然已经收集了许多不同类型的图像，以帮助开发更先进的卡德，但整理后的图像数据的标签仍然存在问题（Muller et al.，2007,2008,2010,2012）。在没有适当标签的情况下，自动识别成像模式是最初的重要步骤，因为图像的语义和内容可能根据形态的不同而有很大的不同

2.1.2. ImageCLEF dataset

We used the medical Subfigure Classification dataset used in

the Image Conference and Labs of the Evaluation Forum (ImageCLEF) 2016 competition (García Seco de Herrera et al., 2016; Villegas et al., 2016). We used the standard pre-defined training set of

6776 images and test set of 4166 images from 30 different imagemodalities. Ground truth annotations are available for both image datasets. While a multitude of different types of images have been collected to assist in the development of more advanced CADs, the labelling of the collated image data remains problematic (Müller et al., 2007, 2008, 2010,2012). In cases where appropriate labels are absent, automatic identification of the imaging modality is an initial important step because the semantics and content of an image can vary greatly depending on the modality

2.1.3.ISIC数据集我们使用了来自2017年国际皮肤成像合作组织（ISIC）竞赛的皮肤病分类数据集（Codella等人，2018年）。该数据集是一个临床数据集，包含2000张训练图像和600张测试图像，其中有3种不同的皮肤病变诊断（良性痣、脂溢性角化病和黑色素瘤）。地面注释来自专家临床医生和病理报告。临床皮肤镜检查图像

2.1.3. ISIC dataset

We used the skin diseases classification dataset from the International Skin Imaging Collaboration (ISIC) 2017 competition (Codella et al., 2018). The dataset is a clinical dataset and contains 2000 training images and 600 test images with 3 different diagnoses of skin lesions (benign nevus, seborrheic keratosis, and melanoma). Ground annotations were obtained from expert clinicians as well as pathology reports. The clinical dermoscopy images

IRMA代码1121-420-212-700技术代码x射线，x平片，概述图像方向代码其他方向，枕额解剖代码面部颅骨，眼区生物代码肌肉骨骼系统

IRMA Code 1121-420-212-700

Technical Code X-ray, Plain radiography,

Overview Image

Directional Code Other orientation,

occipitofrontal

Anatomical Code Facial cranium, eye area

Biological Code Musculoskeletal system

图1。一个样本x射线图像（人脸）和相应的标签从IRMA代码。

Fig. 1. A sample X-ray image (face) and the corresponding labels from IRMA code

在这个数据集中具有复杂和多样的图像特征，为识别不同的皮肤条件的重要挑战

in this dataset have complex and diverse image characteristics for

the important challenge of recognising different skin conditions

2.2.方法2.2.1。图2是对我们的CSKN框架的概述。我们首先使用一个核映射来表示医学图像数据的局部特征。然后，作为训练前的步骤，我们在RKHS中学习卷积稀疏特征，作为卷积核学习的起点。然后，我们以前馈的方式学习了一个多层核网络。最后，我们应用SPP提取最终的图像表示，捕捉细微的和有区别的几何变化。

2.2. Methods

2.2.1. Overview of the CSKN framework

Fig. 2 is an overview of our CSKN framework. We first used a

kernel map to represent the local features of medical image data.

Then, as a pre-training step, we learned convolutional sparse features in a RKHS as a starting point of convolutional kernel learning. We then learned a multi-layer kernel network in a feedforward

manner. Finally, we applied SPP to extract a final image representation that captures subtle and discriminative geometric variations.

2.2.2.背景：卷积神经网络（CNNs）CNN层通常有： (1)卷积层来学习权重（即滤波器），可用于从输入中提取特征；(2)线性操作，然后是点态非线性，如sigmoid函数或修正线性单位，(3)池化层来聚集空间接近的特征（在过程中对数据进行降采样）。单层CNN的输出可以表示为： f (O) = poolp（σ（W O + b））(1)，其中O是输入特征图，σ（·）是点态非线性函数，θ = {W，b}是参数集（即权值和偏差）。池化函数表示一个降采样操作，p表示池化区域的大小。该符号表示线性卷积。当一个卷积层是密集的和非结构化的时，它被称为“完全连通的”。例如，成熟的AlexNet（K里日耶夫斯基et al.，2012）CNN有8个可训练的

2.2.2. Background: convolutional neural networks (CNNs)

CNN layers generally have: (1) convolutional layers to learn

weights (i.e., filters) that can be used to extract features from

the input; (2) a linear operation followed by a pointwise nonlinearity such as the sigmoid function or rectified linear units and,

(3) pooling layers to aggregate features that are in spatial proximity (down-sampling the data in the process). The output of single

layer CNN can be represented as:

f(O) = poolp(σ (W O + b)) (1)

where O is the input feature map, σ(·) is the pointwise non-linear

function, and θ = {W, b} are the set of parameters (i.e., weights

and biases). The pool function denotes a down-sampling operation and p is the size of pooling region. The symbol indicates

the linear convolution. When a convolutional layer is dense and

unstructured, it is called “fully connected”. For example, the wellestablished AlexNet (Krizhevsky et al., 2012) CNN has 8 trainable

层包括五个卷积层，然后是三个完全连接的层。然而，训练这样一个CNN是具有挑战性的，因为需要仔细调整的超参数的数量。一些主要的超参数包括可学习滤波器的大小、层数、每层的输出数和降采样因子的大小。次优超参数选择导致过拟合和无法得到最优的高级语义图像特征。一些有监督的cnn利用无监督的分层预训练方案来更好地实现图像数据的泛化（Le，2013；Romero等人，2016）。预训练作为一种正则化的形式，使差异最小化，并限制后续监督训练的参数值范围（Erhan et al.，2010）。分层无监督预训练允许所有可用的未标记图像数据用于对网络的局部参数进行预训练，这可能为进一步的监督训练提供了一个良好的初始化点

layers comprising five convolutional layers followed by three fully

connected layers. Training such a CNN, however, is challenging because of the number of hyperparameters that need to be carefully

tuned. Some major hyperparameters include the size of learnable

filters, the number of layers, the number of outputs per layer, and

the size of the down-sampling factor. Sub-optimal hyperparameter choice leads to overfitting and an inability to derive optimal high-level semantic image features. Some supervised CNNs have exploited unsupervised layerwise pre-training schemes to render better generalisation of image data (Le, 2013; Romero et al., 2016). The pre-training acts as a form of regularisation which minimises variance and restricts the range of the parameter values for subsequent supervised training (Erhan et al., 2010). Layerwise unsupervised pre-training allows all the available unlabelled image data to be used to pre-train the network’s local parameters, which potentially provides a good initialisation point for further supervised training.

2.2.3.将核学习与cnn相结合，我们的cskn具有经典的cnn层次结构，但使用核映射来表示图像特征。使用核映射，通过建模不变性来理解图像数据的局部几何形状（Mairal et al.，2014）。我们建议，内核结合层次架构，允许在不依赖标签的情况下有效地学习图像特征。两层CSKN的体系结构如图3所示。让我们考虑大小为m×m（本文中为m=200）的两个图像块O和O，其中是一组像素坐标（= {1，…m}2）。Given the locations z and z in , let sz ∈O and sz ∈ O be subpatches of the image feature map, we define a single layer convolutional kernel network as follows (Mairal et al., 2014): KO, O = z,z∈ szHszHe− 1 2β2 z−z22 e− 12α2 s˜z−s˜z2H .(2

2.2.3. Combining kernel learning with CNNs

Our CSKNs have the classic hierarchical architecture of CNNs

but use kernel maps to represent image features. A kernel map

is used to understand the local geometry of the image data by

modelling invariance (Mairal et al., 2014). We suggest that kernels

coupled with a hierarchical architecture allow the effective learning of image features without a reliance on labels. The architecture

of a two-layer CSKN is shown in Fig. 3. Let us consider two image patches O and O of an image of size m × m (m = 200 in this

paper), with being a set of pixel coordinates ( = {1, ... m}2).

Given the locations z and z in , let sz ∈O and sz ∈ O be subpatches of the image feature map, we define a single layer convolutional kernel network as follows (Mairal et al., 2014):

KO, O = z,z∈ szHszHe− 1 2β2 z−z22 e− 12α2 s˜z−s˜z2H .

图3。一个两层CSKN；每一层都是上一层的所有子补丁之间的一个加权匹配核。其中·H表示希尔伯特规范。核K是一个正确定核，它由子补丁的图像特征之间的两两比较的和组成。术语szHzH的作用是强调空间和特征的相似性(由ex捕获

Fig. 3. A two-layer CSKN; each layer is a weighted match kernel between all subpatches of the previous layer.

where ·H denotes the Hilbertian norm. The kernel K is a positive

definitive kernel that consists of a sum of pairwise comparisons

between image features of sub-patches. The term szHszH acts

to emphasise the spatial and feature similarity (captured by the ex

对于非小强度值的补丁。术语e−12β2z−z22捕获了z和z，之间的空间距离，术语e−12α2˜z−˜z2 H测量的是子斑块之间的特征相似性。这两个术语与希尔伯特范数项一起工作，创建一个内核，为在空间和强度上都接近的补丁给出更大的值。我们使用了两种不同类型的输入特征映射：

ponential terms) for non-small intensity-valued patches. The term

e− 12β2 z−z22 captures spatial distance between z and z, and the

term e− 12α2 s˜z−s˜z2H measures the feature similarity between subpatches. These two terms work in conjunction with the Hilbertian

norm terms to create a kernel that gives larger values for patches

that are close in both space and intensity. We used two different

types of input feature maps:

(1) 补丁图：子补丁sz是一个以z为中心的图像子补丁大小b×b。子补丁sz是简单的Rb×b，s˜z表示子补丁的对比标准化版本。(2)梯度图：子patchsz是图像在像素z处的二维梯度，根据各维的一阶差计算。在这个公式中，szH是梯度强度，s˜z表示其方向定义为与[cos θ，sin θ]的角度（Bo et al.，2010）。当输入数据在一个紧凑的数据集中（Rd，d≤2）时，等式(2)可以通过在一个足够大的集合上进行均匀采样来近似；术语e−12β2z−z22表示一个空间核，e−12α2s˜z−s˜z2 H表示梯度图。

(1) Patch map: the sub-patch sz is an image sub-patch size b × b

centred at z. The sub-patch sz is simply Rb×b and s˜z denotes

a contrast-normalised version of the sub-patch.

(2) Gradient map: the sub-patch sz is the two-dimensional gradient of the image at pixel z, which is computed with firstorder differences along each dimension. In this formulation,

szH is the gradient intensity and s˜z denotes its orientation

defined as an angle with [cos θ, sin θ] (Bo et al., 2010). When

the input data is in a compact set (Rd, d ≤ 2), Eq. (2) can

be approximated by uniform sampling over a large enough

set; the term e− 12β2 z−z22 indicates a spatial kernel and

e− 12α2 s˜z−s˜z2H denotes the gradient map.

系数β和α是平滑控制空间距离的高斯核参数

The coefficients β and α are smoothing Gaussian kernel parameters that control spatial distances between z and z and the

分别表示在希尔伯特空间中的s˜z和s˜z之间的接近性。相应的核映射被形式化为训练样本中所有子补丁之间的加权匹配核，它定义了图像的特征表示

feature closeness between s˜z and s˜z in the Hilbert space, respectively. The corresponding kernel map is formalised as a weighted match kernel between all sub-patches from training samples that defines a feature representation of the image

2.2.4.当输入数据具有高维数时，通过CSKNs匹配内核的无监督特征学习计算代价昂贵（Rd，d > 2）。计算复杂度也随着样本量的增加而呈二次增长。为了防止维数的诅咒，我们使用了Mairal等人（2014）提出的一种具有有限维嵌入的快速逼近方法。对于所有的u，∈1和z，∈：

2.2.4. Unsupervised feature learning via CSKNs

Match kernels are expensive to compute when the input data

has high dimensionality (Rd, d > 2). The computational complexity also grows quadratically with increasing sample sizes. To prevent the curse of dimensionality, we used a fast approximation approach with finite-dimensional embedding proposed by

Mairal et al. (2014). For all u ∈1 and z ∈:

KO, O ≈ u∈1 g (u;O)T gu;O (3) g (u;O) := z∈ e− 1 2β2 u−z22 h (z;O) (4) h (z;O) := sz2bie− 1α2 Wi−s˜z22 n1 i=1 (5) where 1 is a subset of , n1 denotes number of filters, and b and W are learned parameters.这个操作可以被认为是类似于特征映射的空间卷积和点态非线性。由于K（O，O）是匹配核项的和，我们可以学习使用训练数据来近似核。参数b和W是在子补丁级别学习通过解决一个优化问题：分钟Wi，binc=1e−˜−c−˜c22 2α2−pi=1bie−Wi−˜c22α2 e−Wi−˜c22α2。(6)我们从训练数据中随机选择了40万对子补丁，并使用标准的有限记忆布罗伊登·弗莱彻·戈德法布·山诺与边界（L-BFGS-B）（Byrd et al.，1995）优化器来求解等式(6)（Mairal等人，2014）。L-BFGSB需要更少的参数，在许多应用中，如图像分类中，可以优于共轭梯度（CG）或随机梯度下降（SGD）（Ngiametal.，2011）。

KO, O ≈ u∈1 g(u;O)T gu;O (3)

g(u;O) := z∈ e− 1 2β2 u−z22 h(z;O) (4)

h(z;O) := sz2bie− 1α2 Wi−s˜z22 n1 i=1

(5)

where 1 is a subset of , n1 denotes number of filters, and b and

W are learned parameters. This operation can be considered to be

similar to a spatial convolution of the feature map followed by a

pointwise non-linearity. Since K(O, O) is a sum of the match kernel terms, we can learn to approximate the kernel using training data. The parameters b and W are learned at the sub-patch level by solving an optimisation problem: minWi,bi nc=1 e− s˜c−s˜c22 2α2 − pi=1 bie− Wi−s˜c22 α2 e− Wi−s˜c22 α2 . (6) We randomly selected 400,000 pairs of sub-patches from the training data and used the standard Limited memory Broyden Fletcher Goldfarb Shanno with Bounds (L-BFGS-B) (Byrd et al., 1995) optimiser to solve Eq. (6) (Mairal et al., 2014). The L-BFGSB requires less parameters and can be superior to the conjugate gradient (CG) or stochastic gradient descent (SGD) in many applications such as image classification (Ngiam et al., 2011).

2.2.5.我们制定了一种分层无监督特征学习算法，该算法有效地增强了RKHS中的总体和寿命稀疏性（EPLS）。我们的方法与Romero等人（2015）的原始EPLS算法相比，在RKHS中学习卷积稀疏特征，该算法从分解的原始图像斑块中学习稀疏特征。在统一特征空间中学习到的卷积稀疏特征通常更具区别性，因此允许我们构建更多特定于类的表示（Thiagarajan et al.，2014）。此外，我们的方法学习到的卷积特征保留了邻域像素之间的关系，从而学习局部结构，减少参数的冗余（Ranzata等人，2007；Romero等人，2015）。这个

2.2.5. Initialisation of CSKN via layerwise unsupervised pre-training

with sparsity

We formulated a layerwise unsupervised feature learning algorithm that efficiently enforces population and lifetime sparsity

(EPLS) in a RKHS. Our approach learns convolutional sparse features in a RKHS, in contrast to Romero et al.’s (2015) original EPLS

algorithm that learns sparse features from decomposed raw image

patches. The convolutional sparse features learned in the unified

feature space are often more discriminative and therefore allow

us to build more class-specific representations (Thiagarajan et al., 2014). Furthermore, the convolutional features learned by our method preserve the relationships between neighbourhood pixels so as to learn local structures and reduce redundancy in the parameters (Ranzato et al., 2007; Romero et al., 2015). The

学习到的参数用作中央学习的初始化点（即每层的θ = {W，b}的初始值）。该算法迭代地创建输入数据的一个特定于层的稀疏目标，并通过最小化层的输出和稀疏目标之间的误差来优化字典。因此，稀疏性的程度在每一层都有不同的控制和学习。然后，该层的参数计算如下：

learned parameters are used as initialisation points in CSKNs learning (i.e., the initial value of θ = {W, b} of each layer). The algorithm iteratively creates a layer-specific sparse target of the input data and optimises the dictionary by minimising the error between the output of the layer and the sparse target. The degree of sparsity is therefore controlled and learned differently at each layer. The parameters of the layer are then calculated as follows:

θl = arg min θl Ol − Tl2H,

其中，Ol∈RNb×Nh为RKHS中的数据向量，表示为用于构造第l层核矩阵的训练样本的加权组合，Tl表示处理总体和寿命稀疏性的该层的稀疏目标。

where Ol ∈ RNb×Nh are the data vectors in RKHS, which are represented as a weighted combination of the training samples used to construct the kernel matrix at layer l, and Tl denotes the sparse target of the layer that addresses population and lifetime sparsity.

算法1是单层EPLS推导的伪代码。让我们将Oj定义为行向量O的一个元素，并将Ol表示为维数为Nh的Nb输出向量，其中Nb是小批量的大小。从Tl中没有激活开始（第1行），Ol的输入补丁在0到1之间标准化（第2行）。该算法通过选择Ol激活值的第n行的第k个具有最大Ok减去抑制剂cj的第k行元素，迭代处理一行Ol的Oj（第5行）。这里，抑制剂是一个累加器，计算输出j被选择的次数，将其抑制剂增加Nh/N，直到达到最大抑制，其中N是训练补丁的总数。这加强了寿命稀疏性，并防止选择已激活Nh/N次的输出

Algorithm 1 is the pseudocode of the single layer EPLS derivation. Let us define Oj as an element of row vector O and denote Ol as Nb output vectors of dimensionality Nh, where Nb is the size of mini-batch. Starting with no activation in Tl (line 1), input patches of Ol are normalised between 0 and 1 (line 2). The algorithm iteratively processes a row Oj of Ol by selecting the kth element of the n-th row of Ol that has the maximal activation value Ok minus an inhibitor cj (line 5). Here, the inhibitor is an accumulator that counts the number of times an output j has been selected, increasing its inhibitor by Nh/N until reaching maximal inhibition, where N is the total number of training patches. This enforces the lifetime sparsity and prevents the selection of an output that has already been activated Nh/N times.

然后激活目标矩阵Tl的第n行的第k个元素，如第6行一样（即，通过分配1），考虑到总体的稀疏性。抑制剂逐步更新，最后将输出目标重新映射到相应非线性的活性和非活性值。与等式的优化(7)使用具有自适应学习率的标准随机梯度下降（SGD）进行（Schaul et al.，2013）。

The kth element of n-th row of target matrix Tl is then activated as in line 6 (i.e., by assigning 1), considering population sparsity. The inhibitor is progressively updated and finally the output target is remapped to active and inactive values of corresponding non-linearity. The optimisation in relation to Eq. (7) is performed using standard stochastic

gradient descent (SGD) with adaptive learning rates (Schaul et al.,

2013) .

2.2.6.多层卷积稀疏核网络（CSKNs）可以以分层的方式学习CSKN核，以实现更深入的和有可能改进的高级语义特征表示。本质上： (1)层l + 1的输入特征图可以通过应用卷积运算、学习到的权重和偏差来计算来自第l层的核映射；(2)然后使用EPLS学习初始稀疏特征，作为CSKN学习的起点；(3)以前馈的方式学习多层CSKN，使用给定的输入子块，以及每个层的核参数α和β。

2.2.6. Multi-layer convolutional sparse kernel networks (CSKNs)

A CSKN kernel can be learned in a hierarchical fashion for a

deeper and potentially improved high-level semantic feature representation. Essentially: (1) the input feature map of layer l + 1

can be computed by applying the convolution operation, learned

weights and biases to kernel maps from layer l; (2) EPLS is then

used to learn initial sparse features that are used as a starting

point of CSKN learning; (3) a multi-layer CSKN is learned in a feedforward manner, using a given input sub-patch of size Sz, and kernel parameters α and β for each layer.

2.2.7.我们添加了SPP作为最后一个特征池化层，以提取动态图像表示，也捕捉细微的几何变化。SPP层的输出为p·M维向量，带有M个多级空间箱（p为滤波器大小）。我们根据CSKN生成的最后一个特征图（x×x）确定每个金字塔层的窗口大小(n)，即为win = x/n。然后，我们通过选择跨不同位置和不同内核地图的不同空间尺度上的最大值（最大池），汇集和聚合每个过滤器的响应。这提供了对局部转换更具鲁棒性的不变图像表示。图中。2和4显示了一个与我们的CSKN结合的SPP层。

2.2.7. Capturing subtle geometric variations with SPP layer We added SPP as the last feature pooling layer to extract a fi- nal image representation that also captures subtle geometric variations. The outputs of the SPP layer are p · M dimensional vectors with M multi-level spatial bins (p is the filter size). We determined the window size of each pyramid level (n) based on the last feature maps (x × x) generated from CSKN, as win = x/n. We then pooled and aggregated the responses of each filter by selecting the maximum values (max pooling) across different locations and over different spatial scales of the kernel map. This provides invariant image representations that are more robust to local transformations. Figs. 2 and 4 show a SPP layer combined with our CSKN.

3.实验性的第3.1节。为了评估我们的框架，我们将其与其他无监督和监督学习方法进行了比较： (a)传统的无监督特征学习方法： SIFT + BoVW、独立成分分析（ICA）和稀疏编码（Lee等人，2006年）。我们与BoVW模型（SIFT + BoVW）一起实现了SIFT描述符。在提取SIFT描述符时，我们使用了16×16像素，间距为8像素。我们使用的标准码本大小为1000（Avni et al.，2011）。ICA第一层和稀疏编码的过滤器数量（即权重）均设置为1600（Romero et al.，2015）

3. Experimental

3.1. Evaluation

To evaluate our framework we compared it to other unsupervised and supervised learning methods:

(a) Conventional unsupervised feature learning methods:

SIFT + BoVW, Independent Component Analysis (ICA), and

sparse coding (Lee et al., 2006). We implemented the SIFT

descriptor together with BoVW model (SIFT + BoVW). We

used a patch size of 16 × 16 pixels with spacing of 8 pixels

in the extraction of SIFT descriptors. We used the standard codebook size of 1000 (Avni et al., 2011). The number of filters (i.e., weights) for the first layer of the ICA, and sparse coding were all set to 1600 (Romero et al., 2015)

(b)最先进的无监督学习方法： SSAE（Hinton等人，2006；Shin等人，2013）和CKN（Mairal等人，2014）。SSAE的第一层的过滤器数被设置为1600，这与上面的传统基线一致；我们将第二层的过滤器数设置为1024。由于CKN也是一种内核学习方法，为了进行比较，我们使用与我们提出的CSKN相同的参数来训练CKN（见第3.2节）。

(b) State-of-the-art unsupervised learning methods: SSAE

(Hinton et al., 2006; Shin et al., 2013) and CKN (Mairal et al.,

2014). The number of filters for the first layer of the SSAE

was set to 1600, which was consistent with the conventional

baselines above; we set the number of filters for the second

layer to 1024. As the CKN is also a kernel learning method,

for the purpose of comparison we trained the CKN using the

same parameters as our proposed CSKN (see Section 3.2).

最先进的有监督的预训练的cnn（与自然图像）。我们使用了阿列克斯Net（克里日夫斯基等人，2012年）、VGG（西蒙尼扬和齐泽曼，2014年）、谷歌网（塞格迪等人，2015年）和ResNet（He等人，2016年），这些工具在图像网挑战的对象识别和定位方面获得了很高的排名。对于所有预先训练的cnn模型，使用网络全连接层作为特征提取器

State-of-the-art supervised pre-trained CNNs (with natural

images). We used the AlexNet (Krizhevsky et al., 2012), VGG

(Simonyan and Zisserman, 2014), GoogLeNet (Szegedy et al.,

2015), and ResNet (He et al., 2016), which have achieved

high rankings in object recognition and localisation from the

ImageNet Challenge. For all pre-trained CNNs models, the fi-

nal fully-connected layers were used as the feature extractors.

最近有一个具有代表性的基于稀疏性的预训练CNN（具有自然图像），我们使用了一个模型，其中65%的ResNet的密集参数被删除（Liu et al.，2018）。这个修剪模型的准确性与原始的预先训练的ResNet最相似。最后的全连接层被用作特征提取器

A recent representative sparsity-based pre-trained CNN (with natural images) where we used a model in which 65% of the dense parameters of ResNet were removed (Liu et al.,

2018) . The accuracy of this pruned model was most similar to its counterpart of original pre-trained ResNet. The final fully-connected layer was used as the feature extractor

(e)最先进的有监督的经过微调的cnn。我们使用了与上述预先训练的基线相同的模型：AlexNet（克里日夫斯基等人，2012年）、VGG（西蒙尼扬和齐塞曼，2014年）、谷歌网（Szegedy等人，2015年）和ResNet（He等人，2016年）。对于医学图像分析，这些精细的cnn已被证明与完全训练的cnn一样出色，甚至在有限的训练数据下表现出色（Kumar等，2017；Tajbakhsh等，2016；Shin等，2016）。所有的模型都使用IRMA数据集进行了60个时期的训练。我们使用了批大小为128，初始学习率为10−4。我们使用学习速率退火，当误差稳定时，将速率衰减10倍。

(e) State-of-the-art supervised fine-tuned CNNs. We used the

same models as in the pre-trained baselines above: AlexNet

(Krizhevsky et al., 2012), VGG (Simonyan and Zisserman, 2014), GoogLeNet (Szegedy et al., 2015), and ResNet (He et al., 2016). For medical image analysis, these finetuned CNNs have been shown to perform as well as fully trained CNNs or even outperform when there is limited

training data (Kumar et al., 2017; Tajbakhsh et al., 2016; Shin et al., 2016). All of the models were trained for 60 epochs with the IRMA dataset. We used a batch size of 128 and an initial learning rate of 10−4. We used learning rate annealing, decaying the rate by a factor of 10 when the error plateaued.

3.2.实现细节cskn有四个需要确定的参数：子补丁的大小，系数α和β，以及池化因子或过滤器大小p。我们的高斯核α和β的参数是为每一层自动确定的：β被设置为池化因子除以√2；α被设置为子补丁之间成对距离分布的0.1分位数，与Mairal等人（2014）报告的工作一致。在我们的实验中，最终的结果对使用更小的分位数不敏感，如0.01和0.001。这也与其他研究结果一致，如Paulin等人（2015）。

3.2. Implementation details

CSKNs have four parameters that need to be determined for each layer: size of sub-patch, coefficients α and β, and pooling factor or filter size p. The parameters of our Gaussian kernel α and β are automatically determined for each layer: the β was set to be the pooling factor divided by √2; α was set to be the 0.1 quantile of the distribution of pair-wise distances between sub-patches, consistent with the work reported by Mairal et al. (2014). In our

experimentation, the final results were insensitive to the use of smaller quantiles such as 0.01 and 0.001. This is also consistent with other research studies, e.g., Paulin et al. (2015).

x射线图像检索（IRMA数据集）：我们采用了两层架构，该架构在灰度图像上表现更好（Paulin et al.，2015）。我们使用梯度图（在第2.2.3节中定义）作为我们架构的初始层的输入；作为输入的梯度图已被证明比原始补丁表现得更好（Mairal et al.，2014）。我们的参数选择过程搜索在一个有限的空间内寻找参数的最优值。我们使用了2-8范围内的子补丁大小和池化因子为100、256、512、800和1024。对于SPP层，我们在所有实验中都使用了50个空间箱中的4级空间金字塔（1×1,2×2,3×3,6×6）。

X-Ray image retrieval（IRMA dataset): We adopted a two-layer

architecture that was shown to perform better on grey-scale images (Paulin et al., 2015). We used the gradient map (defined in Section 2.2.3) as the input of the initial layer of our architecture; the gradient map as input has been shown to perform better than raw patches (Mairal et al., 2014). Our parameter selection process searched within a restricted space to find the optimal values of the parameters. We used values in the range 2–8 for sub-patch sizes

and pooling factors of 100, 256, 512, 800 and 1024. For the SPP layer, we used a 4-level spatial pyramid (1 × 1, 2 × 2, 3 × 3, 6 × 6) of 50 spatial bins in all of our experiments

医学图像模态分类（ImageCLEF数据集）和皮肤病分类（ISIC数据集）：我们使用与x射线图像检索（如上所述）相同的设置，但使用原始补丁而不是梯度图作为输入，因为原始补丁在处理RGB图像时表现更好。然后，我们根据经验选择了其余的参数，如表1所示。对于所有学习到的特征，我们使用了Yang等人（2009）引入的多类线性SVM设置，他们使用了可微二次铰链损失，这样可以很简单地通过基于梯度的优化方法完成训练。我们使用了LBFGS，学习率为0.1，正则化参数为1，与Yang等人（2009）指定的参数一致。

Medical image modality classification (ImageCLEF dataset) and

skin disease classification (ISIC dataset): We used the same settings as the X-ray image retrieval（described above) but used raw

patches instead of gradient maps as the input because the raw

patches performed better when working with RGB images. We then empirically chose the remaining parameters as shown in Table 1. For all learned features, we used the setup of the multiclass linear SVM introduced by Yang et al. (2009), who used a differentiable quadratic hinge loss so that the training could easily be done with simple gradient-based optimisation methods. We used LBFGS with a learning rate of 0.1 and a regularisation parameter of 1, consistent with the parameters specified by Yang et al. (2009).

3.3.计算所有的神经网络——CSKN、SSAE、CKN、微调后的cnn——都使用GeForce GTX 1080 Ti GPU（11GB内存）进行训练。在一台使用英特尔酷睿i7-6800K 3.40 GHz（6核）处理器的机器上，使用这个GPU花了8小时进行训练。

3.3. Computation

All the neural networks—CSKN, SSAE, CKN, fine-tuned CNNs—were trained with a GeForce GTX 1080 Ti GPU (11GB memory). It took 8 h for our CSKN to be trained with this GPU on a machine

with Intel Core i7-6800K 3.40 GHz (6 cores) processor.

表1显示了每一层的子补丁大小、子采样因子和池化因子的数量。对于初始梯度图，值16表示方向的数量。

For each layer, the sub-patch size, sub-sampling factor, and the number of pooling factor are

shown. For initial gradient map, the value 16 indicates the number of orientations.

表2在Q = 1、5、10和30处的平均图像检索精度估计值（%）（基于IRMA数据集）。最好的和第二好的精度分别用粗体和斜体表示。

Table 2

Average image retrieval precision estimates (%) at Q = 1, 5, 10, and 30 (based on the IRMA

dataset). The best and second-best precisions are in bold and italics respectively.

3.4.我们在IRMA数据集（Avni等人，2011）上进行了医学图像检索实验，在ImageCLEF数据集（维勒加斯等人，2016）和ISIC数据集（Codella等人，2018）上进行了分类实验。在医学图像检索实验中，我们使用地面真实注释（即IRMA代码）来衡量相似性的相关性。然后将每个测试图像作为查询图像，并根据与查询图像的欧氏距离对训练图像进行排序。为了进行定量比较，我们使用Q = 1、5、10和30的精度估计如下：

精度@Q=#顶部Q图像检索中的相关图像#Q检索图像的图像

Precision@Q = # relevant images in top Q images retrieved

# images of Q retrieved images

3.5.医学图像模态分类在分类实验中，我们使用了前1个精度（预测标签的正确性），这是最近CNN研究中对医学图像模态分类采用的标准性能度量（Kumar et al.，2017）。对于具有ImageCLEF数据集的监督cnn模型的结果，我们使用了在他们各自的论文中报道的结果

3.5. Medical image modality classification

For the classification experiments, we used the top 1 accuracy (the correctness of the predicted label), which is the standard performance measure adopted in recent CNN studies for the classi-fication of medical image modalities (Kumar et al., 2017). For the results of the supervised CNNs models with ImageCLEF dataset, we used the results reported in their respective papers

3.6.皮肤病分类我们使用了受试者工作特征（ROC）曲线中的曲线下面积（AUC），这是2017年ISIC竞赛的主要评价指标（Codella等人，2018年）。对于使用ISIC数据集的监督CNNs模型的结果，我们使用了在他们各自的论文中报告的结果。

3.6. Skin diseases classification

We used the area under curve (AUC) from the receiver operating characteristics (ROC) curve which were the main evaluation metrics in the ISIC 2017 competition (Codella et al., 2018). For the results of the supervised CNNs models with ISIC dataset, we used the results reported in their respective papers.

3. 结果图像检索实验结果见表2。我们在图5中展示了对不同结构的查询和检索的样本结果。查询图像为肩胛肱关节肩（上排）、肩锁关节肩（中排）和（左下）前臂（下排）

4. Results

The results of image retrieval experiments are shown in Table 2. We show sample results of the query and retrieval of varying structures in Fig. 5. The query images are the shoulder of the scapulo-humeral joint (top row), shoulder of the acromio-clavicular joint (middle row), and (bottom left) forearm (bottom row), with

文物，包括钢板、螺丝和电线。检索到的图像按从左到右的相似度顺序进行排序（前1-3位）。我们的框架比其他无监督特征学习算法以及其他预先训练的CNN模型具有更高的准确性。此外，它优于所有的微调cnn，达到了52.97%的最高1个精度。当考虑到前5、10和30个检索图像时，微调后的GoogLeNet方法获得了最好artifacts including plates, screws and wires. The retrieved images

are ranked by the order of similarity from left to right (top 1–3). Our framework had greater accuracy than other unsupervised feature learning algorithms as well as other pre-trained CNN models. Furthermore, it outperformed all the fine-tuned CNNs, achieving a top 1 precision of 52.97%. The fine-tuned GoogLeNet method achieved the best precision when considering the top 5, 10, and 30 retrieved images的精度

图像模态分类实验的结果如表3所示。我们将我们的方法与几种传统的无监督特征学习方法以及在2016年举行的比赛中提出的基于监督图像的方法进行了比较。我们的CSKN比其他无监督的方法有更高的准确性，达到了70.99%的前1个准确性。第二好的无监督方法是SSAE，准确率为65.17%。表现最好的监督方法是微调后的ResNet-152，精度为85.38%（Koitka和Friedrich，2016）。

The results of image modality classification experiments are shown in Table 3. We compared our approach with several conventional unsupervised feature learning methods as well as the supervised image-based methods presented in the competition held in 2016. Our CSKN had greater accuracy than other unsupervised approaches, achieving a top 1 accuracy of 70.99%. The second best unsupervised method was SSAE with an accuracy of 65.17%. The best performing supervised method was the fine-tuned ResNet-152 with an accuracy of 85.38% (Koitka and Friedrich, 2016).

皮肤病分类实验结果见表4。与模态分类实验一致，我们将我们的方法与其他无监督特征学习方法以及在2017年举行的比赛中提出的监督方法进行了比较。我们的框架比其他无监督方法具有更高的准确性，比第二优的无监督方法（SSAE）提高了10%以上，达到了76.11%的平均AUC准确性。它也比预训练的ResNet（72.35%）和微调的初始空间V3（75.00%）有更高的准确性。微调后的ResNet方法的平均AUC最好，为91.10%

Table 4 shows the results of skin diseases classification experiments. Consistent with the modality classification experiments, we

compared our approach with other unsupervised feature learning methods as well as the supervised methods presented in the competition held in 2017. Our framework had greater accuracy than other unsupervised approaches, with over 10% improvement from the second best unsupervised method (SSAE), achieving a mean AUC accuracy of 76.11%. It also had a higher accuracy than pretrained ResNet (72.35%) and fine-tuned Inception V3 (75.00%). The fine-tuned ResNet method had the best mean AUC of 91.10%

图6显示了与包括随机初始化和K-mean算法在内的其他标准预训练方法相比，我们使用基于稀疏性的预训练进行初始化是如何改进了医学图像的特征表示的。我们还展示了SPP所做的改进。从我们的CSKN的第一层学习到的权重的可视化如图7所示。这表明，我们的CSKN学习了共同的结构，如线和边，也识别了医学图像中的空间模式和稀疏区域。我们使用了40万个大小为12×12的图像补丁，并学习了256个滤波器（奥尔豪森和Field，1996）。来自更深层网络的结果如图8所示。我们使用3层和4层CSKN架构的实验并没有提高性能。

Fig. 6 shows how our initialisation with sparsity-based pretraining improves the feature representation of medical images compared to other standard pre-training methods including random initialisation and the K-mean algorithm. We also show the improvement made by the SPP. The visualisation of the learned weights from the first layer of our CSKN is shown in Fig. 7. This shows that our CSKN learnt common structures such as lines and edges and also identified spatial patterns and sparse regions in the medical images. We used 400,000 image patches of size

12 × 12 and learned 256 filters (Olshausen and Field, 1996). The results from deeper networks are shown in Fig. 8. Our experiments using 3 and 4 layer CSKN architectures did not improve performance.

图5。使用CSKN查询和检索x射线图像的样本结果。

Fig. 5. Sample results of query and retrieval of X-ray images using CSKN.

4. 我们的结果表明，我们的CSKN优于其他传统的无监督方法；它在x射线图像检索中与最先进的监督cnn具有相当的准确性，与其他监督同行的精度相当

5. Discussion

Our results show that our CSKN outperformed other conventional unsupervised approaches; it had comparable accuracy to state-of-the-art supervised CNNs in X-ray image retrieval and comparable accuracy to other supervised counterparts in medi

卡尔图像形态的分类和一些皮肤状况的分类。此外，我们还证明了基于稀疏性的预训练改进了医学图像的特征表示，并将其归因于我们的鲁棒预训练方案，该方案为后续的卷积核学习提供了良好的初始化点。它作为一种限制规范的规则化形式

cal image modality classification and classification of some skin conditions. Further, we showed that sparsity-based pre-training improves the feature representation of medical images, and we

attribute this to our robust pre-training scheme which provided good initialisation points for subsequent convolutional kernel learning. It acts as a form of regularisation that restricts param-

图6。随机和k均值初始化的CKN的平均精度、精度和平均AUC，以及我们使用SPP改进的CSKN

Fig. 6. Top 1 average precision, accuracy and mean AUC of CKN with random and K-mean initialisation, and our improved CSKN with SPP

图7。使用ImageCLEF数据集（灰度），通过CSKN的第一层将学习到的权重进行可视化。

Fig. 7. The visualisation of learned weights by the first layer of the CSKN using ImageCLEF dataset (grey-scale).

我们可以进入某些对医学图像数据更具鉴别性的空间（Erhan et al.，2010；米什金和马塔斯，2015年）。SPP框架还通过一种多层次的空间特征池技术，改进了医学图像中的特征表示（见图6），有效地表征了图像数据中的局部几何信息。据我们所知，这是第一个采用传统方法将无监督预训练与无监督学习框架结合起来的工作，而传统方法是将无监督预训练与随后的监督学习相结合（Erhan et al.，2010；萨利曼和Kingma，2016）。

eters to certain spaces that are more discriminative for medical

image data (Erhan et al., 2010; Mishkin and Matas, 2015). The SPP

framework also improves feature representation in medical images

(see Fig. 6) through a multi-level spatial feature pooling technique

that effectively characterises the local geometry information in the

image data. To the best of our knowledge, this is the first work

that couples unsupervised pre-training with unsupervised learning

frameworks when compared to the conventional approach which

is to combine unsupervised pre-training with subsequent supervised learning (Erhan et al., 2010; Salimans and Kingma, 2016).

对于x射线图像检索，我们的无监督CSKN在Q=1（见表2）时达到了最高的准确率（52.97%），这表明CSKN能够学习和提取特定于数据的特征。用传统的无监督手工制作的特征，如SIFT、BoVW模型、稀疏编码和ICA，学习到的特征的质量没有SSAE那样鲁棒。预先训练的cnn的准确性低于我们的方法，因为这些方法提取的特征没有针对特定的数据集或应用程序进行调整，因此提取最有意义或有区别的特征的能力有限。预训练的cnn网络具有更高的精度（如VGG-16到ResNet-152层），微调的GoogLeNet在前5、10和30的精度最高。我们将此归因于其利用卷积网络的局部稀疏结构来构建的网络体系结构（Szegedy et al.，2015）。我们的方法旨在学习特定于类别的图像特征，以更好地以无监督的方式进行辨别，但这意味着它可以对微妙的内部关系很敏感

For X-ray image retrieval, our unsupervised CSKN achieved the

highest accuracy (52.97%) when Q = 1 (see Table 2), suggesting

that the CSKN was able to learn and extract data-specific features.

The quality of the features learned with conventional unsupervised

hand-crafted features such as SIFT coupled with BoVW model,

sparse coding, and ICA were not as robust as that of the SSAE.

The accuracy of pre-trained CNNs was lower than our method as

these approaches extracted features that were not tuned to a particular dataset or application, and as such have limited capacity to

extract the most meaningful or discriminative features. The deeper

network of pre-trained CNNs had higher accuracy (e.g., VGG-16 to

ResNet-152 layers) and the fine-tuned GoogLeNet had the highest

accuracy in top 5, 10, and 30. We attribute this to its network architecture exploiting the local sparse structure of a convolutional

网络（Szegedy等人，2015）。我们的方法旨在学习特定类别的图像特征，以便在无监督的方式下更好地识别，但这意味着它可以对微妙的类间变化很敏感，这就是为什么随着更微妙的相似图像的检索，与有监督的图像相比，精度下降的速度更高。对于医学图像检索应用程序，一个查询的5个最相似的图像（即前5个）通常用于比较分析（Quellec et al.，2010）。我们的CSKN获得了具有竞争力的前5名的准确率（44.18%），仅次于微调后的谷歌CSKN（44.61%）

network (Szegedy et al., 2015). Our method was designed to learn class-specific image features for better discrimination in an unsupervised fashion but this means it can be sensitive to subtle interclass variations which is why accuracy drops at a higher rate compared to supervised counterparts as more subtly similar images are retrieved. For medical image retrieval applications, the five most similar images (i.e., top 5) for a query are commonly used for comparative analysis (Quellec et al., 2010). Our CSKN achieved a competitive top 5 accuracy (44.18%), which was the second best after the fine-tuned GoogLeNet (44.61%)

在医学图像模态分类中，我们的无监督CSKN优于所有其他无监督方法，并实现了与ImageCLEF 2016挑战中一部分的所有监督cnn相当的准确性。与x射线图像结果相似，使用传统的无监督方法、稀疏编码和ICA提取的图像特征的质量不如SSAE的鲁棒性。与稀疏编码和ICA不同，SSAE以分层的方式学习图像特征，因此是最接近我们的方法的方法。表现最好的方法都是基于成熟的监督神经网络，包括AlexNet（库马尔等人，2016年）、VGG（Semedo和马加莱斯，2016年）、谷歌Net（科特卡和弗里德里希，2016年）和ResNet（科特卡和弗里德里希，2016年）。这些cnn从头开始训练或用医学图像进行微调，以获得高水平的数据特定特征。正如预期的那样，较深的cnn也比较浅的cnn具有更高的准确性（见表3）。我们的无监督的CSKN (accu In medical image modality classification, our unsupervised CSKN outperformed all other unsupervised approaches and achieved a comparable accuracy to all supervised CNNs that were part of the ImageCLEF 2016 challenge. Similar to the X-ray image results, the quality of image features extracted using conventional unsupervised approaches, sparse coding and ICA, were not as robust as that of the SSAE. Unlike sparse coding and ICA, SSAE learned image features in a hierarchal manner and hence was the closest method to our approach. The top performing methods were all based on well-established supervised CNNs including AlexNet (Kumar et al., 2016), VGG (Semedo and Magalhães, 2016), GoogLeNet (Koitka and Friedrich, 2016), and ResNet (Koitka and Friedrich, 2016). These CNNs were trained from scratch or fine-tuned with medical images to derive high-level data specific features. As expected, deeper CNNs also had higher accuracy

稳定性为70.99%)的表现优于监督的VGG样CNNs（65.31%）（Semedo和马加尔海斯，2016），在模态分类方面改善超过5%。虽然大多数参考方法使用相同的训练数据，但Koitka和弗里德里希（2016）报告的在比赛中表现最好的方法，增加了来自其他来源的额外数据，这有助于其总体准确性。

racy of 70.99%) performed better than supervised VGG-like CNNs (65.31%) (Semedo and Magalhães, 2016) with over 5% improvement in modality classification. While most of referenced methods used the same training data, the method reported by Koitka and Friedrich (2016) that had the best performance in the competition, added extra data from additional sources which contributed to its overall accuracy.

比较浅的cnn（见表3）。我们的无监督CSKN（准确率为70.99%）优于有监督的VGG样CNNs（65.31%）（Semedo和Magalhaes，2016），在模态分类方面改善了超过5%。虽然大多数参考方法使用相同的训练数据，但Koitka和弗里德里希（2016）报告的在比赛中表现最好的方法，增加了来自其他来源的额外数据，有助于其总体准确性

than shallower CNNs (see Table 3). Our unsupervised CSKN (accuracy of 70.99%) performed better than supervised VGG-like CNNs (65.31%) (Semedo and Magalhães, 2016) with over 5% improvement in modality classification. While most of referenced methods used the same training data, the method reported by Koitka and Friedrich (2016) that had the best performance in the competition, added extra data from additional sources which contributed to its

overall accuracy

ImageCLEF数据集还包含不同的通用生物医学插图，如基因序列或化学结构，因此与x射线IRMA数据集相比，图像特征有更多样化和复杂的变化。因此，与其他监督方法相比，我们的CSKN在ImageCLEF数据集上的总体性能比IRMA数据集更差。然而，我们的方法能够在不依赖标签的情况下从各种图像模式中获得有区别的医学图像特征，其准确性优于有监督的VGG类CNNs（Semedo和Magalhaes，2016）。

The ImageCLEF dataset also contains different generic biomedical illustrations such as gene sequences or chemical structures, and so in comparison to the X-ray IRMA dataset, there were more diverse and complex variations in image characteristics. As a consequence, the overall performance of our CSKN compared to other supervised approaches was poorer on the ImageCLEF dataset than the IRMA dataset. Nevertheless, our method was able to derive discriminative medical image features from a variety of image modalities without reliance on labels, and its accuracy was better than that of supervised VGG-like CNNs (Semedo and Magalhães, 2016).

在皮肤损伤分类中，我们的无监督CSKN优于所有其他无监督特征学习方法，比预训练的ResNet（72.35%）和微调的初始V3（75.00%）获得更高的准确率（76.11%）。与其他数据集的结果一致，SSAE是其他无监督方法中第二好的方法。在比赛中报道的表现最好的方法也使用了微调的cnn，例如，AlexNet、VGG、《盗梦空间》v3和ResNet。表现最好的方法（松永等人，2017；Bi等人，2017），然而，增加了来自其他来源的额外数据。这表明精细的cnn仍然依赖于标记数据的可用性。另一方面，我们的CSKN能够以一种完全无监督的方式学习有意义的医学图像特征。

In classification of the skin lesions our unsupervised CSKN

outperformed all other unsupervised feature learning methods and achieved a higher accuracy (76.11%) than pre-trained ResNet (72.35%) and fine-tuned Inception V3 (75.00%). Consistent to the results from other datasets, SSAE was the next best approach among other unsupervised methods. The top performing approaches reported in the competition also used fine-tuned CNNs, e.g., AlexNet, VGG, Inception v3, and ResNet. The best performing methods (Matsunaga et al., 2017; Bi et al., 2017), however, added extra data from additional sources. This indicates that the finetuned CNNs are still dependent on the availability of labelled data. Our CSKN, on the other hand, was able to learn meaningful medical image features in a completely unsupervised fashion.

5.1.虽然我们的方法在没有监督和标签的情况下学习了医学图像的特征表示，但每一层的一些参数(包括子补丁大小、子采样因子或池化因子（即滤波器大小）必须通过经验推导出来（见第3.2节）。一般来说，较小的子采样因子和较大的池化因子以增加计算复杂度为代价，会导致更好的性能。然而，我们的研究结果显示，即使使用了不同的参数，基于稀疏性的预训练和SPP池也持续地改善了整体的特征表示。我们使用高斯径向基函数（RBF）核的一个积分形式来近似核映射（在RKHS中的图像特征表示）。一种描述医学图像的不同特性的多核方法（Song et al.，2018）可能会提供更有意义的特征表示，我们将在未来探索这类方法。

5.1. Limitations and future work

Although our approach learned medical image feature representations without supervision and labels, some of the parameters (including sub-patch size, sub-sampling factor, or pooling

factor (i.e., filter size) for each layer, must be empirically derived (see Section 3.2). Generally, smaller subsampling factors and larger pooling factors led to better performance at the cost of increased computational complexity. Nevertheless, our results show that sparsity-based pre-training and SPP pooling consistently improved overall feature representation even when different parameters were used. We used an integral form of the Gaussian Radial Basis Function (RBF) kernel to approximate the kernel map (image feature representation in a RKHS). A multi-kernel approach (Song et al., 2018) describing diverse properties of medical images

could potentially provide more meaningful feature representation and we will explore such approaches in the future.

我们建议，我们的无监督初始化将有利于监督学习方法时，有有限的标签

We suggest that our unsupervised initialisation will benefit supervised learning approaches when there are limited labelled

训练数据当CSKN用于初始化CNN进行监督微调时，它可能比传统的CNN微调方法能够推导出更有意义的图像数据表示。对微调的影响的调查本身就是一项实质性的研究，因此我们将在未来的工作中继续进行。由于我们的CSKN是完全无监督的，我们建议它可以被视为访问医学成像存储库中大量未注释数据的重要第一步。我们注意到，与其他监督cnn相比，我们的CSKN需要在更少的层（本文中的两层）中学习更少的参数，因此，可以有效地与后续的监督学习方法耦合，而无需需要大的计算成本。

training data. When CSKN is used to initialise a CNN for supervised fine-tuning, it could potentially enable the derivation of semantically more meaningful representations of the image data than traditional CNN fine-tuning approaches that are initialised with natural images. The investigation of the impact on fine-tuning is a substantial research study in itself and so we will pursue this in future work. Since our CSKN is completely unsupervised, we suggest that it can be considered as an important first step to accessing the large volume of unannotated data in medical imaging repositories. We note that compared to other supervised CNNs, our CSKN requires learning fewer parameters across fewer layers (two layers in this paper), and therefore, can be efficiently coupled with subsequent supervised learning approaches without a large computational cost.

6.结论提出了一种新的基于无监督稀疏的特征学习框架。我们的卷积稀疏特征分层预训练，改进了图像检索和分类中的学习结果和特征表示。我们在三个大型公共数据集上将我们的方法与其他无监督和监督方法进行了比较，表明我们的方法与最先进的监督cnn具有竞争力。我们的方法证明了使用大量未标记的医学数据来描述医学图像特征的可行性，并提供了访问医学成像存储库中可用的大量未注释数据的机会。

6. Conclusion

We have proposed a new unsupervised sparsity-based feature learning framework for characterisation of medical image data. Our layerwise pre-training, using convolutional sparse features, improved the learning outcomes and feature representations in image retrieval and classification. We compared our approach to other unsupervised and supervised methods on three large public datasets and showed that our approach was competitive with the state-of-the-art supervised CNNs. Our approach demonstrated the feasibility of using large collections of unlabelled medical data to characterise medical image features and offers the opportunity to access the large volume of unannotated data that are available in medical imaging repositories.