[CV论文填坑]

注:每一段笔者用绿色标注开头;为方便阅读,每一句话,被分解为一段

0.abstract

The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain.


This paper demonstrates how such constrains can be integrated into a backpropagation network through the archiecture of the network.


The approach has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service.


A single network learns the entire recogition operation, going from the normalized image of the character to the final classification.



1.introduction

Previous work performed on recognizing simple digit image(LeCun 1989) showed that good generalization on complex tasks can be obtained by designing a network architechture that contains a certain amount of a priori knowledge[priori knowledge:先验知识] about the task.


The basic design principle is to reduce the number of free parameters in the network as much as possible without overly reducing its computational power.
Application of this principle increases the probability of correct generalization because it results in a specialized network architechture that has a reduced entropy[reduced entropy:降熵](Denker et al 1987;Patarnello and Carnevali 1987, Tishby et al.1989; LeCun 1989), and a reduced Vapnik-Chervonenkis dimensionality[reduced dimensionality:降维](Baum and Haussler 1989)
In this paper, we apply the backpropagetion algorithm(Rumelhart et al 1986) to a real-world problem in recognizing handwritten digits taken from the U.S. Mail.
Unlike previous results reported by our group on the problem (Denker et al 1989), the learning network is directly fed with images, rather than feature vectors, thus demonstrating[demonstrate:证明展示] the ability of backpropagation networks to deal with large amounts of low-level information.

2.zip codes
2.1 data base The data base used to train and test the network consist of 9298 segmented numerals digitized from handwritten zip codes that appeared on U.S. mail passing through the Buffalo, NY post office.
Examples of such image are shown in Figure 1.The digits were written by many different people, using a great variety of sizes, writing styles, and instruments, with widely varying amounts of care;7291 examples are used for training the network and 2007 are used for testing the generalization performance[generalization performance:泛化能力].

在这里插入图片描述


One important feature of this data base is that both the training set and the testing set contain numerous examples that are ambiguous[ambiguous:含糊不清的], unclassifiable, or even misclassified.

2.2 Preprocessing

Locating the zip code on the envelope and separating each digit from its neighbours, a very hard task in itself, was performed by Postal Service contractors(Wang and Srihari 1988).

At this point, the size of a digit image varies but is typically around 40 by 60 pixels.
A linear transformation is then applied to make the image fit in a 16 by 16 pixel image.
This transformation preserves the aspact ratio of the character, and is performed after extraneous marks in the image have been removed.
Because of the linear transformation, the resulting image is not binary but multiple gray levels[multiple gray levels:多个灰度级], since a variable number of pixels in the origin image can fall into a given pixel in the target image. [此处大意:为了将可变大小的原图映射到16*16的目标图,使用该线性变换,导致了原来的二值图变为有多个灰度级的图]
The gray levels of each image are scaled and translated to fall within the range -1 to 1.
3.Network Deigne

3.1 Input and Output

The remainder of the recognition is entirely performed by a multilayer network.[multilayer network:多层神经网络]
All of the connections in the network are adaptive, although heavily constrained, and are trained using backpropagation.
This is in contrast with [in contrast with:与...想反] earlier work (Denker et al.1989) where the first few layers of connections were hand-chosen constants implemented on a neural-network chip.
The input of the network is a 16 by 16 normalized image.
The output is compoesd of 10 units(one per class) and uses place coding [place coding:位置编码].
3.2 Feature Maps and Weight Sharing Classical work in visual pattern recognition has demonstrated the advantage of extracting local features and combining them to form higher order features.
Such knowledge can be easily built into the network by forcing the hidden units to combine only local sources of information.
Distinctive features of an object can appear at various locations on the input image.
Therefore it seems judicious to have a set of feature detectors that can detect a particular instance of a feature anywhere on the input plane.
Since the precise location of a feature is not relevant to the classification, we can afford to lose some position information in the process.
Nevertheless, approximate postion information must be preserved, to allow the next levels to detect higher order, more complex features(Fukushima 1980;Mozer 1987) [此处大意:特征在图片中的精确位置与分类是不相关的,所以我们可以丢弃一定的信息。不过,也要保留部分位置信息,以使下一级可以检测更高阶,更复杂的特征]
The detection of a particular feature at any location on the input can be easily done using the "weight sharing" technique.
Weight Sharing was described in Rumelhart et al (1989) for the so-called T-C problem and consists in having several connections(links) controlled by a single parameter(weight).
It can be interpreted as imposing equality constraints among the connection strengths.
This technique can be implemented with very little computational overhead. [computational overhead:计算开销]
Weight sharing not only greatly reduces the number of free parameters in the network but also can express information about the geometry and topology of the task.
In our case, the first hidden layer is composed of several planes that we call feature maps. [feature maps:特征图,可以理解为卷积后的结果]
All units in a plane share the same set of weights, thereby detecting the same feature at different locations.
Since the exact position of the feature is not important, the feature maps need not have as many units as the input.

3.3 Network Architecture

The network is represented in Figure 2.


Its architecture of the one proposed in LeCun(1989).
The network has three hidden layers named H1, H2 and H3, respectively. [respectively:分别]
Connections entering H1 and H2 are local and are heavily constrained.

在这里插入图片描述

H1 is composed of 12 groups of 64 units arranged as 12 independent 8 by 8 feature maps.

These 12 feature maps will be designated by H1_1, H1_2,...,H1_12.
Each unit in a feature map takes input on a 5 by 5 neighbourhood on the input plane.
For units in layer H1 that are one unit apart, their receptive field [receptive field:感受野] (in the input layer) are two pixels apart.
Thus, the input image is undersampled and some position information is eliminated.
A simplar two-to-one undersampling occurs going from layer H1 to H2.
The motivation is that high resolution may be needed to detect the presence of a feature, while its exact position need not be determined with equally high precision. [此处大意:我们这样做的目的是,在检测特征的存在时,需要高分辨率,而无需以相同的高精度确定特征的位置]



It is also known that the kinds of features that are important at one place in the image are likely to be important in other places.



Therefore, corresponding connections on each unit in a given feature map are constrained to have the same weights.



Each unit performs the same operation on corresponding parts of the image.


The function performed by a feature map can thus be interpreted as a nonlinear subsampled convolution with a 5 by 5 kernel.


Of course, units in another (say H1.4) share another set of 25 weights.


Units do not share their biases(thresholds).


Each unit thus has 25 input lines plus a bias.


Connections extending past the boundaries of the input plane take their input from a virtual background plane whose state is equal to a constant, predetermined background level, in our case-1


Thus, layer H1 comprises 768 units (8 by 8 times 12), 19968 connections(768 times 26), but only 1068 free parameters(768 bias plus 25 times 12 feature kernels) since many connections share the same weight.


Layer H2 is also composed of 12 features maps.


Each feature map contains 16 units arranged in a 4 by 4 plane.


As before, these feature maps will be designated as H2.1, H2,2, …, H2.12.


The connection scheme between H1 and H2 is quite similar to the one between the input and H1,but slightly more complicated because H1 has multiple two-dimensional maps.


Each unit in H2 combines local information coming from 8 of the 12 different feature maps in H1.


Its receptive field is composed of eight 5 by 5 neighborhoods centered around units that are at identical positions within each of the eight maps.


Thus, a unit in H2 has 200 inputs, 200 weights, and a bias.


Once again, all units in a given map in H1 on which a map in H2 takes its inputs are chosen according a scheme that will not be described here.


Connections falling off the boundaries are treated like as in H1.


To summarize, layer H2 contains 192 units(12 times 4 by 4) and there is a total of 38592 connections between layers H1 and H2(192 units times 201 input lines). All these connections are controlled by only 2592 free parameters(12 feature maps times 200 weights plus 192 biases).


Layer H3 has 30 units, and is fully connected to H2.


The number of connections between H2 and H3 is thus 5790 (30 times 192 plus 30biases).


The output layer has 10 units and is also fully connected to H3, adding another 310 weights.


In summary, the network has 1256 units, 64660 connections, and 9760 independent parameters.

Logo

为开发者提供学习成长、分享交流、生态实践、资源工具等服务,帮助开发者快速成长。

更多推荐