Transcript
Adaptive Elastic Models for Hand-Printed Character Recognition Georey E. Hinton, Christopher K. I. Williams and Michael Revow Department of Computer Science, University of Toronto Toronto, Ontario, Canada M5S 1A4
Abstract Hand-printed digits can be modeled as splines that are governed by about 8 control points. For each known digit, the control points have preferred "home" locations, and deformations of the digit are generated by moving the control points away from their home locations. Images of digits can be produced by placing Gaussian ink generators uniformly along the spline. Real images can be recognized by nding the digit model most likely to have generated the data. For each digit model we use an elastic matching algorithm to minimize an energy function that includes both the deformation energy of the digit model and the log probability that the model would generate the inked pixels in the image. The model with the lowest total energy wins. If a uniform noise process is included in the model of image generation, some of the inked pixels can be rejected as noise as a digit model is tting a poorly segmented image. The digit models learn by modifying the home locations of the control points.
1 Introduction Given good bottom-up segmentation and normalization, feedforward neural networks are an ecient way to recognize digits in zip codes. (?). However, in some cases, it is not possible to correctly segment and normalize the digits without using knowledge of their shapes, so to achieve close to human performance on images of whole zip codes it will be necessary to use models of shapes to in uence the segmentation and normalization of the digits. One way of doing this is to use a large cooperative network that simultaneously segments, normalizes and recognizes all of the digits in a zip code. A rst step in this direction is to take a poorly segmented image of a single digit and to explain the image properly in terms of an appropriately normalized, deformed digit model plus noise. The ability of the model to reject some parts of the image as noise is the rst step towards model-driven segmentation.
2 Elastic models
One technique for recognizing a digit is to perform an elastic match with many dierent exemplars of each known digit-class and to pick the class of the nearest neighbor. Unfortunately this requires a large number of elastic matches, each of which is expensive. By using one elastic model to capture all the variations of a given digit we greatly reduce the number of elastic matches required. Burr (1981a, 1981b) has investigated several types of elastic model and elastic matching procedure. We describe a dierent kind of elastic model that is based on splines. Each elastic model contains parameters that de ne an ideal shape and also de ne a deformation energy for departures from this ideal. These parameters are initially set by hand but can be improved by learning. They are an ecient way to represent the many possible instances of a given digit. Each digit is modelled by a deformable spline whose shape is determined by the positions of 8 control points. Every point on the spline is a weighted average of four control points, with the weighting coecients changing smoothly as we move along the spline. 1 To generate an ideal example of a digit we put the 8 control points at their home locations for that model. To deform the digit we move the control points away from their home locations. Currently we assume that, for each model, the control points have independent, radial Gaussian distributions about their home locations. So the negative log probability of a deformation (its energy) is proportional to the sum of the squares of the departures of the control points from their home locations. The deformation energy function only penalizes shape deformations. Translation, rotation, dilation, elongation, and shear do not change the shape of an object so we want the deformation energy to be invariant under these ane transformations. We achieve this by giving each model its own \object-based frame". Its deformation energy is computed relative to this frame, not in image coordinates. When we t the model to data, we repeatedly recompute the best ane transformation between the object-based frame and the image (see section 4). The repeated recomputation of the ane transform during the model t means that the shape of the digit is in uencing the normalization. Although we will use our digit models for recognizing images, it helps to start by considering how we would use them for generating images. The generative model is an elaboration of the probabilistic interpretation of the elastic net given by Durbin, Szeliski & Yuille (1989). Given a particular spline, we space a number of \beads" uniformly along the spline. Each bead de nes the center of a Gaussian ink generator. The number of beads on the spline and the variance of the ink generators can easily be changed without changing the spline itself. To generate a noisy image of a particular digit class, run the following procedure:
Pick an ane transformation from the model's intrinsic reference frame to
the image frame (i.e. pick a size, position, orientation, slant and elongation for the digit). Pick a deformation of the model (i.e. move the control points away from their home locations). The probability of picking a deformation is 1 ;Edeform Ze Repeat many times: Either (with probability noise) add a randomly positioned noise pixel 1 In computing the weighting coecients we use a cubic B-spline and we treat the rst and last control points as if they were doubled.
Or pick a bead at random and generate a pixel from the Gaussian distribution de ned by the bead.
3 Recognizing isolated digits
We recognize an image by nding which model is most likely to have generated it. Each possible model is tted to the image and the one that has the lowest cost t is the winner. The cost of a t is the negative log probability of generating the image given the model.
Eideal = ; log
Z
P (I ) P (image j I ) dI
I 2model instances
(1)
We can approximate this by just considering the best tting model instance 2 and ignoring the fact that the model should not generate ink where there is no ink in the image:3
E = Edeform ;
X log P (pixel j best model instance)
inked pixels
(2)
The probability of an inked pixel is the sum of the probabilities of all the possible ways of generating it from the mixture of Gaussian beads or the uniform noise eld.
model P (i) = noise + N B
X P (i)
beads
b
(3)
where N is the total number of pixels, B is the number of beads, is a mixing proportion, and Pb(i) is the probability density of pixel i under Gaussian bead b.
4 The search procedure for tting a model to an image
Every Gaussian bead in a model has the same variance. When tting data, we start with a big variance and gradually reduce it as in the elastic net algorithm of Durbin and Willshaw (1987) . Each iteration of the elastic matching algorithm involves three steps:
Given the current locations of the Gaussians, compute the responsibility
that each Gaussian has for each inked pixel. This is just the probability of generating the pixel from that Gaussian, normalized by the total probability of generating the pixel. Assuming that the responsibilities remain xed, as in the EM algorithm of Dempster, Laird and Rubin (1977), we invert a 16 16 matrix to nd the image locations for the 8 control points at which the forces pulling the control points towards their home locations are balanced by the forces exerted on the control points by the inked pixels. These forces come via the forces that the inked pixels exert on the Gaussian beads.
2 In eect, we are assuming that the integral in equation 1 can be approximated by the height of the highest peak, and so we are ignoring variations between models in the width of the peak or the number of peaks. 3 If the inked pixels are rare, poor models sin mainly by not inking those pixels that should be inked rather than by inking those pixels that should not be inked.
Given the new image locations of the control points, we recompute the
ane transformation from the object-based frame to the image frame. We choose the ane transformation that minimizes the sum of the squared distances, in object-based coordinates, between the control points and their home locations. The residual squared dierences determine the deformation energy.
Some stages in the tting of a model to data are shown in Fig. 1. This search technique avoids nearly all local minima when tting models to isolated digits. But if we get a high deformation energy in the best tting model, we can try alternative starting con gurations for the models.
5 Learning the digit models
We can do discriminative learning by adjusting the home positions and variances of the control points to minimize the objective function
C =;
X
training cases
;Ecorrect log p(correct digit); p(correct digit) = P e e;Edigit all digits
(4)
For a model parameter such as the x coordinate of the home location of one of the control points we need @C=@x in order to do gradient descent learning. Equation 4 allows us to express @C=@x in terms of @E=@x but there is a subtle problem: Changing a parameter of an elastic model causes a simple change in the energy of the con guration that the model previously settled to, but the model no longer settles to that con guration. So it appears that we need to consider how the energy is aected by the change in the con guration. Fortunately, derivatives are simple at an energy minimum because small changes in the con guration make no change in the energy (to rst order). Thus the inner loop settling leads to simple derivatives for the outer loop learning, as in the Boltzmann machine (Hinton, 1989).
6 Results on the hand- ltered dataset
We are trying out the scheme out on a relatively simple task - we have a model of a two and a model of a three, and we want the two model to win on \two" images, and the three model to win on \three" images. We have tried many variations of the character models, the preprocessing, the initial ane transformations of the models, the annealing schedule for the variances, the mixing proportion of the noise, and the relative importance of deformation energy versus data- t energy. Our current best performance is 10 errors (1.6%) on a test set of 304 two's and 304 three's. We reject cases if the best- tting model is highly deformed, but on this test set the deformation energy never reached the rejection criterion. The training set has 418 cases, and we have a validation set of 200 cases to tell us when to stop learning. Figure 2 shows the eect of learning on the models. The initial ane transform is de ned by the minimal vertical rectangle around the data. The images are preprocessed to eliminate variations due to stroke-width and paper and ink intensities. First, we use a standard local thresholding algorithm to make a binary decision for each pixel. Then we pick out the ve largest connected components (hopefully digits). We put a box around each component, then thin all the data in the box. If we ourselves cannot recognize the resulting image we eliminate
Figure 1: The sequence (a) to (d) shows some stages of tting a model 3 to some data. The grey circles represent the beads on the spline, and the radius of the circle represents the standard deviation of the Gaussian. (a) shows the initial con guration, with eight beads equally spaced along the spline. In (b) and (c) the variance is progressively decreased and the number of beads is increased. The nal t using 60 beads is shown in (d). We use about three iterations at each of ve variances on our \annealing schedule". In this example, we used noise = 0:3 which makes it cheaper to explain the extraneous noise pixels and the ourishes on the ends of the 3 as noise rather than deforming the model to bring Gaussian beads close to these pixels.
Figure 2: The two and three models before and after learning. The control points are labelled 1 through 8. We used maximum likelihood learning in which each digit model is trained only on instances of that digit. After each pass through all those instances, the home location of each control point (in the object-based frame) is rede ned to be the average location of the control point in the nal ts of the model of the digit to the instances of the digit. Most of the improvement in performance occurred after the st pass, and after ve updates of the home locations of the control points, performance on the validation set started to decrease. Similar results were obtained with discriminative training. We could also update the variance of each control point to be its variance in the nal ts, though we did not adapt the variances in this simulation.
it from the data set. The training, validation and test data is all from the training portion of the United States Postal Service Handwritten ZIP Code Database (1987) which was made available by the USPS Oce of Advanced Technology.
7 Discussion
Before we tried using splines to model digits, we used models that consisted of a xed number of Gaussian beads with elastic energy constraints operating between neighboring beads. To constrain the curvature we used energy terms that involved triples of beads. With this type of energy function, we had great diculty using a single model to capture topologically dierent instances of a digit. For example, when the central loop of a 3 changes to a cusp and then to an open bend, the sign of the curvature reverses. With a spline model it is easy to model these topological variants by small changes in the relative vertical locations of the central two control points (see gure 2). This advantage of spline models is pointed out by (Edelman, Ullman and Flash, 1990) who use a dierent kind of spline that they t to character data by directly locating candidate knot points in the image. Spline models also make it easy to increase the number of Gaussian beads as their variance is decreased. This coarse-to- ne strategy is much more ecient than using a large number of beads at all variances, but it is much harder to implement if the deformation energy explicitly depends on particular bead locations, since changing the number of beads then requires a new function for the deformation energy. In determining where on the spline to place the Gaussian beads, we initially used a xed set of blending coecients for each bead. These coecients are the weights used to specify the bead location as a weighted center of gravity of the locations of 4 control points. Unfortunately this yields too few beads in portions of a digit such as a long tail of a 2 which are governed by just a few control points. Performance was much improved by spacing the beads uniformly along the curve. By using spline models, we build in a lot of prior knowledge about what characters look like, so we can describe the shape of a character using only a small number of parameters (16 coordinates and 8 variances). This means that the learning is exploring a much smaller space than a conventional feed-forward network. Also, because the parameters are easy to interpret, we can start with fairly good initial models of the characters. So learning only requires a few updates of the parameters. Obvious extensions of the deformation energy function include using elliptical Gaussians for the distributions of the control points, or using full covariance matrices for neighboring pairs of control points. Another obvious modi cation is to use elliptical rather than circular Gaussians for the beads. If strokes curve gently relative to their thickness, the distribution of ink can be modelled much better using elliptical Gaussians. However, an ellipse takes about twice as many operations to t and is not helpful in regions of sharp curvature. Our simulations suggest that, on average, two circular beads are more exible than one elliptical bead. Currently we do not impose any penalty on extremely sheared or elongated ane transformations, though this would probably improve performance. Having an explicit representation of the ane transformation of each digit should prove very helpful for recognizing multiple digits, since it will allow us to impose a penalty on dierences in the ane transformations of neighboring digits. Presegmented images of single digits contain many dierent kinds of noise that cannot be eliminated by simple bottom-up operations. These include descenders, underlines, and bits of other digits; corrections; dirt in recycled paper; smudges and misplaced postal franks. To really understand the image we probably need to model
a wide variety of structured noise. We are currently experimenting with one simple way of incorporating noise models. After each digit model has been used to segment a noisy image into one digit instance plus noise, we try to t more complicated noise models to the residual noise. A good t greatly decreases the cost of that noise and hence improves this interpretation of the image. We intend to handle ourishes on the ends of characters in this way rather than using more elaborate digit models that include optional ourishes. One of our main motivations in developing elastic models is the belief that a strong prior model should make learning easier, should reduce con dent errors, and should allow top-down segmentation. Although we have shown that elastic spline models can be quite eective, we have not yet demonstrated that they are superior to feedforward nets and there is a serious weakness of our approach: Elastic matching is slow. Fitting the models to the data takes much more computation than a feedforward net. So in the same number of cycles, a feedforward net can try many alternative bottom-up segmentations and normalizations and select the overall segmentation that leads to the most recognizable digit string.
Acknowledgements
This research was funded by Apple and by the Ontario Information Technology Research Centre. We thank Allan Jepson and Richard Durbin for suggesting spline models.
References
Burr, D. J. (1981a). A dynamic model for image registration. Comput. Graphics Image Process., 15:102{112. Burr, D. J. (1981b). Elastic matching of line drawings. IEEE Trans. Pattern Analysis and Machine Intelligence, 3(6):708{713. Durbin, R., Szeliski, R., and Yuille, A. L. (1989). An analysis of the elastic net approach to the travelling salesman problem. Neural Computation, 1:348{358. Durbin, R. and Willshaw, D. (1987). An analogue approach to the travelling salesman problem. Nature, 326:689{691. Edelman, S., Ullman, S., and Flash, T. (1990). Reading cursive handwriting by alignment of letter prototypes. International Journal of Computer Vision, 5(3):303{331. Hinton, G. E. (1989). Deterministic Boltzmann learning performs steepest descent in weight-space. Neural Computation, 1:143{150.