Dynamic Routing Between Capsules

1 Introduction

Human vision ignores irrelevant details by using a carefully determined sequence of fixation points to ensure that only a tiny fraction of the optic array is ever processed at the highest resolution. Introspection is a poor guide to understanding how much of our knowledge of a scene comes from the sequence of fixations and how much we glean from a single fixation, but in this paper we will assume that a single fixation gives us much more than just a single identified object and its properties. We assume that our multi-layer visual system creates a parse tree-like structure on each fixation, and we ignore the issue of how these single-fixation parse trees are coordinated over multiple fixations.

Parse trees are generally constructed on the fly by dynamically allocating memory. Following Hinton et al. [2000], however, we shall assume that, for a single fixation, a parse tree is carved out of a fixed multilayer neural network like a sculpture is carved from a rock. Each layer will be divided into many small groups of neurons called “capsules” (Hinton et al. [2011]) and each node in the parse tree will correspond to an active capsule. Using an iterative routing process, each active capsule will choose a capsule in the layer above to be its parent in the tree. For the higher levels of a visual system, this iterative process will be solving the problem of assigning parts to wholes.

The activities of the neurons within an active capsule represent the various properties of a particular entity that is present in the image. These properties can include many different types of instantiation parameter such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc. One very special property is the existence of the instantiated entity in the image. An obvious way to represent existence is by using a separate logistic unit whose output is the probability that the entity exists. In this paper we explore an interesting alternative which is to use the overall length of the vector of instantiation parameters to represent the existence of the entity and to force the orientation of the vector to represent the properties of the entity ^[1] . We ensure that the length of the vector output of a capsule cannot exceed 1 by applying a non-linearity that leaves the orientation of the vector unchanged but scales down its magnitude.

The fact that the output of a capsule is a vector makes it possible to use a powerful dynamic routing mechanism to ensure that the output of the capsule gets sent to an appropriate parent in the layer above. Initially, the output is routed to all possible parents but is scaled down by coupling coefficients that sum to 1 . For each possible parent, the capsule computes a “prediction vector” by multiplying its own output by a weight matrix. If this prediction vector has a large scalar product with the output of a possible parent, there is top-down feedback which increases the coupling coefficient for that parent and decreasing it for other parents. This increases the contribution that the capsule makes to that parent thus further increasing the scalar product of the capsulersquo;s prediction with the parentrsquo;s output. This type of “routing-by-agreement” should be far more effective than the very primitive form of routing implemented by max-pooling, which allows neurons in one layer to ignore all but the most active feature detector in a local pool in the layer below. We demonstrate that our dynamic routing mechanism is an effective way to implement the “explaining away” that is needed for segmenting highly overlapping objects.

Convolutional neural networks (CNNs) use translated replicas of learned feature detectors. This allows them to translate knowledge about good weight values acquired at one position in an image to other positions. This has proven extremely helpful in image interpretation. Even though we are replacing the scalar-output feature detectors of CNNs with vector-output capsules and max-pooling with routing-by-agreement, we would still like to replicate learned knowledge across space. To achieve this, we make all but the last layer of capsules be convolutional. As with CNNs, we make higher-level capsules cover larger regions of the image. Unlike max-pooling however, we do not throw away information about the precise position of the entity within the region. For low level capsules, location information is “place-coded” by which capsule is active. As we ascend the hierarchy, more and more of the positional information is “rate-coded” in the real-valued components of the output vector of a capsule. This shift from place-coding to rate-coding combined with the fact that higher-level capsules represent more complex entities with more degrees of freedom suggests that the dimensionality of capsules should increase as we ascend the hierarchy.

2 How the vector inputs and outputs of a capsule are computed

There are many possible ways to implement the general idea of capsules. The aim of this paper is not to explore this whole space but simply to show that one fairly straightforward implementation works well and that dynamic routing helps.

We want the length of the output vector of a capsule to represent the probability that the entity

represented by the capsule is present in the current input. We therefore use a non-linear 'squashing' function to ensure that short vectors get shrunk to almost zero length and long vectors get shrunk to a length slightly below 1 . We leave it to discriminative learning to make good use of this non-linearity.

(1)

where v _j is the vector output of capsule j and s_j is its total input. For all but the first layer of capsules, the total input to a capsule s_{j lt;}

剩余内容已隐藏，支付完成后下载完整资料

胶囊之间的动态路由

1 简介

人眼通过仔细确定固定点的顺序来忽略无关的细节以确保只有一小部分的光学阵列曾被处理在最高分辨率。反省对于理解我们对一个场景的了解有多少来自于固定序列和我们从一次固定中获得的信息，但是在本文中我们将假设一次固定，给我们的不仅仅是一个被识别的对象及其属性。我们假设我们的多层视觉系统在每个固定上创建一个解析树状结构，并且我们忽略了这些单固定解析树如何在多个固定上协调的问题。

解析树通常通过动态分配内存来动态构建。追随Hinton et al. [2000]，然而，我们假设，对于单个固定，解析树是从固定的多层神经网络就像雕塑一样是从岩石上雕刻出来的。每一层将被分成许多称为“胶囊”的小神经元群（Hinton et al. [2011]），解析树中的每个节点都将对应于活性胶囊。使用迭代路由过程，每个活动胶囊将选择在上面的一层中的蒴果是它在树上的父。对于更高层次的视觉系统来说迭代过程将解决分配零件到整体的问题。

在一个活跃的囊内神经元的活动代表了一个特定的存在于图像中的实体。这些属性可以包括许多不同类型的实例化姿势（位置、大小、方向）、变形、速度、反照率、色调、纹理等参数。一个非常特殊的属性是图像中存在实例化实体。显而易见的方法表示存在是通过使用一个单独的逻辑单元，其输出是实体存在。在本文中，我们探索了一个有趣的替代方案，即使用实例化参数向量的总长度来表示实体的存在，并强制向量的方向来表示实体^[1]的属性。我们保证矢量输出的长度通过施加一个非线性，使向量的方向保持不变，但减小其大小，胶囊的大小不能超过1。

胶囊的输出是矢量这一事实使得使用强大的动态路由成为可能确保胶囊输出被发送到层中合适的父级的机制上面。最初，输出被路由到所有可能的父节点，但被耦合系数缩小总数是1。对于每个可能的父对象，胶囊通过乘以其通过权重矩阵拥有输出。如果这个预测向量有一个输出为一个可能的父节点，有自上而下的反馈，它增加了父节点的耦合系数为其他父母减少。这增加了胶囊对这个的贡献从而进一步增加胶囊预测的标量积与母体的输出。这种类型的“协议路由”应该比max pooling实现的非常原始的路由形式有效得多，max pooling允许一个层中的神经元忽略下面层中本地池中除最活跃的特征检测器之外的所有特征检测器。我们证明我们的动态路由机制是实现分割高度重叠对象所需的“解释消除”的有效方法。

卷积神经网络（CNN）使用已学习特征检测器的翻译副本。这使得他们能够将在图像中的一个位置获得的关于良好权重值的知识转换到其他位置。这在图像解释中被证明是非常有用的。尽管我们正在用矢量输出胶囊替换cnn的标量输出特征检测器，用协议路由替换max池，但我们仍然希望跨空间复制所学知识。为了达到这个目的，我们将胶囊的最后一层做成卷曲状。与CNNs一样，我们使更高级别的胶囊覆盖图像的更大区域。但是，与最大池不同的是，我们不会丢弃有关实体在区域内的精确位置的信息。对于低水平胶囊，位置信息由胶囊激活的“位置编码”。当我们提升层次结构时，越来越多的位置信息被“速率编码”在胶囊输出向量的实值分量中。这种从位置编码到速率编码的转变，再加上更高级别的胶囊代表更复杂的实体，具有更多的自由度，这表明胶囊的维度应该随着我们提升层次而增加。

2 如何计算胶囊的矢量输入和输出

有许多可能的方法来实现胶囊的一般思想。本文的目的不是探索整个空间，而是简单地说明一个相当简单的实现工作良好，有助于动态路由。

我们希望胶囊输出向量的长度表示胶囊所代表的实体存在于当前输入中的概率。因此，我们使用非线性的“挤压”函数来确保短向量收缩到几乎零的长度，长向量收缩到略低于1的长度。我们把它留给有辨别力的学习来利用这种非线性。

(1)

其中v _j是胶囊j的矢量输出，s_j是胶囊j的总输入。对于除第一层胶囊之外的所有胶囊，胶囊s _j的总输入是来自下面一层胶囊的所有“预测向量”j | i的加权和，通过将下面一层胶囊的输出u_i乘以权重矩阵W_ij而产生。

(2)

其中，c_{i j}是由迭代动态路由过程确定的耦合系数。胶囊i与上述层中的所有胶囊之间的耦合系数总和为1，并由“路由softmax”确定，其初始logits b_ij是胶囊i应耦合到胶囊j的对数先验概率。

(3)

对数先验可以和其他所有权值同时学习。它们取决于两个胶囊的位置和类型，但不取决于当前的输入图像^[2] . 然后，通过测量上一层中每个胶囊j的电流输出v_j与胶囊i所做的预测ucirc; j|i之间的一致性，迭代地细化初始耦合系数。

该协议只是标量积a_ij = v_j .ucirc;_j_|_I。该一致性被视为对数似然，并在计算将胶囊i与更高水平胶囊连接的所有耦合系数的新值之前，将其添加到初始对数it，b _ij中。

在卷积胶囊层中，每个胶囊使用网格的每个成员以及每种胶囊的不同变换矩阵，向上述层中的每种胶囊输出向量的局部网格。

3 数字存在的保证金损失

我们使用实例化向量的长度来表示胶囊实体存在的概率。我们希望数字类k的顶级胶囊在且仅当该数字出现在图像中时具有长实例化向量。为了允许多个数字，我们对每个数字胶囊使用单独的利润损失L _k，k：

(4)

其中，如果存在^[3] k类的数字，且m =0.9且m-=0.1，则T _k=1。缺失数字类的损失的lambda;向下加权停止了从收缩所有数字胶囊的活动向量长度开始的初始学习。我们使用lambda;=0.5。总损失只是所有数字胶囊损失的总和。

4 CapsNet架构

一个简单的CapsNet架构如图1所示。该体系结构很浅，只有两个卷积层和一个完全连接层。Conv 1有256，9times;9卷积核，步长为1，ReLU激活。该层将像素强度转换为局部特征检测器的活动，然后将其用作主胶囊的输入。

主胶囊是多维实体的最低级别，从逆向图形的角度来看，激活主胶囊对应于反转渲染过程。这是一种非常不同的计算类型，而不是将实例化的部分拼凑在一起，形成熟悉的整体，这就是胶囊设计的擅长之处。

第二层（原始胶囊）是一个具有32个回旋8D胶囊通道的回旋胶囊层（即每个初级胶囊包含8个回旋单元，9times;9核，跨距2）。每个初级胶囊输出看到所有256times;81 Conv 1单位的输出，其接受图1：一个3层的简单帽。该模型给出了与深卷积网络（如Chang and Chen [2015]）相当的结果。DigitCaps层中每个胶囊的活动向量的长度表示存在每个类的实例，并用于计算分类损失。W_{i j}是原核胶囊中每个u _i，iisin;（1,32times;6times;6）与v_j，jisin;（1,10）之间的一个权重矩阵。

图2：从DigitCaps层表示重建数字的解码器结构。在训练过程中，图像与乙状结肠层输出之间的欧几里德距离最小化。在训练过程中，我们使用真实标签作为重建目标。

字段与胶囊中心的位置重叠。原始胶囊总共有[32times;6times;6]个胶囊输出（每个输出是一个8d向量），并且[6times;6]网格中的每个胶囊彼此共享其权重。一级胶囊可以看作是一个卷积层，方程1是它的块非线性。最后一层（数字帽）每个数字类有一个16d胶囊，每个胶囊接收来自下面层中所有胶囊的输入。

我们只在两个连续的胶囊层（如原发胶囊和数字帽）之间进行布线，因为Conv 1的输出是1d，所以它的空间没有方向可供商定。因此，在Conv 1和原发胶囊之间不使用路由。所有路由登录（b _ij）都初始化为零。因此，最初以相等的概率（c_ij）将胶囊输出（u_i）发送到所有父胶囊（v 0hellip;v 9）。

我们在TensorFlow中实现（Abadi et al. [2016]），我们使用Adam优化器（Kingma and Ba [2014]）及其TensorFlow默认参数，包括指数衰减学习率，以最小化等式4中的边际损失之和。

4.1作为正则化方法的重构

我们使用额外的重建损失来鼓励数字胶囊编码输入数字的实例化参数。在训练过程中，除了正确的数字胶囊的活动向量外，我们都将其屏蔽。然后我们使用这个活动向量来重建输入图像。数字胶囊的输出被馈送到解码器中，解码器由3个完全连接的层组成，这些层对像素强度进行建模，如图2所示。我们最小化逻辑单位输出和像素强度之间的平方差之和。我们将这个重建损失缩小0.0005，这样在训练过程中它就不会控制边缘损失。如图3所示，从CapsNet的16D输出重构是稳健的，同时只保留重要的细节。

图3：带有3个路由迭代的CapsNet的MNIST测试重构示例。（l，p，r）分别表示标签、预测和重建目标。最右边的两列显示了一个失败示例的两个重构，并解释了该模型如何混淆此图中的5和3。其他列来自正确的分类，显示模型在平滑噪波的同时保留了许多细节。

表1:CapsNet分类测试精度。报告了3个试验的MNIST平均值和标准差结果。

方法	路由	重建	MNIST(%)	MultiMNIST(%)
基线	-	-	0.39	8.1
CapsNet	1	No	0.34_plusmn;0.032	-
CapsNet	1	Yes	0.29_plusmn;0.011	7.5
CapsNet	3	No	0.35_plusmn;0.036	-
CapsNet	3	Yes	0.25_plusmn;0.005	5.2

5 MNIST 胶囊

对28times;28 MNIST进行训练（LeCun et al. [1998]）在每个方向上以零填充移动高达2像素的图像。不使用其他数据增强/变形。该数据集分别有60K和10K幅图像用于训练和测试。

我们使用一个没有任何模型平均值的模型进行测试。Wan et al. [2013]通过旋转和缩放对数据进行加密和增强，达到0.21%的测试误差。没有他们，他们的成绩是0.39%。我们得到了一个低测试误差（0.25%）在一个3层网络以前只实现了较深的网络。标签。1报告不同CapsNet设置下MNIST的测试错误率，并说明路由和重建正则化器的重要性。添加重建正则化器可强制增强胶囊向量中的姿势编码来实现路由性能。

基线是一个标准的CNN，有三个卷积层，共256256128个信道。每个有5x5个内核，步幅为1。最后的卷积层之后是大小为328192的两个完全连接层。最后一个完全连接层与一个具有交叉熵损失的10类softmax层连接。利用Adam优化器对基线进行2像素移位MNIST训练。基线的设计是为了在保持计算量不变的情况下，在MNIST上获得最佳性能成本接近卡普斯内。在参数数目方面，基线有35.4米，而CapsNet有8.2米和6.8米，没有重建子网。

5.1胶囊的各个尺寸代表什么

由于我们只传递一个数字的编

剩余内容已隐藏，支付完成后下载完整资料

资料编号：[410041]，资料为PDF文档或Word文档，PDF文档可免费转换为Word

原文和译文剩余内容已隐藏，您需要先支付 30元 才能查看原文和译文全部内容！立即支付

以上是毕业论文外文翻译，课题毕业论文、任务书、文献综述、开题报告、程序设计、图纸设计等资料可联系客服协助查找。

注册

找回密码

胶囊之间的动态路由外文翻译资料

1 简介

4.1作为正则化方法的重构

5.1胶囊的各个尺寸代表什么

您可能感兴趣的文章