Principal Visual Cognition
Due to the technical work on the site downloading books (as well as file conversion and sending books to email/kindle) may be unstable from May, 27 to May, 28 Also, for users who have an active donation now, we will extend the donation period.

Visual Cognition

These essays tackle some of the central issues in visual cognition, presenting experimental techniques from cognitive psychology, new ways of modeling cognitive processes on computers from artificial intelligence, and new ways of studying brain organization from neuropsychology, to address such questions as: How do we recognize objects in front of us? How do we reason about objects when they are absent and only in memory? How do we conceptualize the three dimensions of space? Do different people do these things in different ways? And where are these abilities located in the brain?

While this research, which appeared as a special issue of the journal Cognition, is at the cutting edge of cognitive science, it does not assume a highly technical background on the part of readers. The book begins with a tutorial introduction by the editor, making it suitable for specialists and nonspecialists alike.
The MIT Press
296 / 289
ISBN 10:
ISBN 13:
Bradford Books
PDF, 8.23 MB
Descarga (pdf, 8.23 MB)

You may be interested in Powered by Rec2Me


Most frequently terms

You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.

Engineering Electromagnetics - Full Solutions Manual

PDF, 13.60 MB

Social work law

PDF, 12.28 MB



by Steven


Special Issue


Visual Cognition


c D E n m o n

Special Issue

First Published as a Special Issue of Cognition, Jacques Mehler,
Visual Cognition, Steven Pinker, Guest Editor

Visual Cognition

edited by
Steven Pinker

A Bradford Book
The MIT Press
Cambridge, Massachusetts
London, England

Third printing, 1988
First MIT Press edition, 1985
Copyright © 1984 by Elsevier Science Publishers B.V., Amsterdam, The Netherlands
All rights reserved. No part of this publication may be reproduced, stored in a retrieval
system, or transmitted, in any form or by any means, electronic, mechanical,
photocopying, recording, or otherwise, without the prior permission of the copyright
Reprinted from Cognition: International Journal of Cognitive Psychology, volume 18
(ISSN: 0010-0277). The MIT Press has exclusive license to sell this English-language
book edition throughout the world.
Printed and bound in the United States of America
Library of Congress Cataloging in Publication Data
Main entry under title:
Visual cognition.
(Computational models of cognition and perception)
“ A Bradford book.”
“ Reprinted from Cognition international journal of cognitive psychology, volume
18”—T.p. verso.
Bibliography: p.
Includes index.
1. Visual perception. 2. Cognition. I. Pinker, Steven, 1954. II. Series.
BF241.V564 1985
ISBN 0-262-16103-6




Visual Cognition: An Introduction
Steven Pinker


Parts of Recognition
Donald D. Hoffman and Whitman A. Richards


Visual Routines
Shimon Ullman


Upward Direction, Mental Rotation, and Discrimination of Left and
Right Turns in Maps
Roger N. Shepard and Shelley Hurwitz
Individual Differences in Mental Imagery Ability: AComputational
Stephen M. Kosslyn, Jennifer Brunn, Kyle R. Cave, and Roger W.



The Neurological Basis of Mental Imagery: AComponential Analysis
Martha J. Farah






This collection of original research papers on;  visual cognition first appeared as a
special issue of Cognition: International Journal of Cognitive Science. The study
of visual cognition has seen enormous progress in the past decade, bringing impor­
tant advances in our understanding of shape perception, visual imagery, and
mental maps. Many of these discoveries are the result of converging investiga­
tions in different areas, such as cognitive and perceptual psychology, artificial
intelligence, and neuropsychology. This volume is intended to highlight a sample
of work at the cutting edge of this research area for the benefit of students and
researchers in a variety of disciplines. The tutorial introduction that begins the
volume is designed to help the nonspecialist reader bridge the gap between the
contemporary research reported here and earlier textbook introductions or litera­
ture reviews.
Many people deserve thanks for their roles in putting together this volume:
Jacques Mehler, Editor of Cognition; Susana Franck, Editorial Associate of Cog­
nition', Henry Stanton and Elizabeth Stanton, Editors of Bradford Books; Kath­
leen Murphy, Administrative Secretary, Department of Psychology, MIT; Loren
Ann Frost, who compiled the index; and the ad hoc Cognition referees who
reviewed manuscripts for the special issue. I am also grateful to Nancy Etcoff,
Stephen Kosslyn, and Laurence Parsons for their advice and encouragement.
Preparation of this volume was supported by NSF grants BNS 82-16546 and
BNS 82-19450, by NIH grant 1R01 HD 18381, and by the MIT Center for Cogni­
tive Science under a grant from the A. P. Sloan Foundation.

Visual Cognition


Visual cognition: An introduction*
Massachusetts Institute of Technology

This article is a tutorial overview o f a sample o f central issues in visual cogni­
tion, focusing on the recognition o f shapes and the representation o f objects
and spatial relations in perception and imagery. Brief reviews o f the state o f
the art are presented, followed by more extensive presentations o f contemporary
theories, findings, and open issues. I discuss various theories o f shape recogni­
tion, such as template, feature, Fourier, structural description, Marr-Nishihara, and massively parallel models, and issues such as the reference frames,
primitives, top-down processing, and computational architectures used in spa­
tial cognition. This is followed by a discussion o f mental imagery, including
conceptual issues in imagery research, theories o f imagery, imagery and per­
ception, image transformations, computational complexities o f image pro­
cessing, neuropsychological issues, and possible functions o f imagery. Connec­
tions between theories o f recognition and o f imagery, and the relevance o f the
papers contained in this issue to the topics discussed, are emphasized through­
Recognizing and reasoning about the visual environment is something that
people do extraordinarily well; it is often said that in these abilities an average
three-year old makes the most sophisticated computer vision system look
embarrassingly inept. Our hominid ancestors fabricated and used tools for
millions of years before our species emerged, and the selection pressures
brought about by tool use may have resulted in the development of sophisti­
cated faculties allowing us to recognize objects and their physical properties,
to bring complex knowledge to bear on familiar objects and scenes, to
^Preparation of this paper was supported by NSF grants BNS 82-16546 and 82-09540, by NIH grant
1R01HD18381-01, and by a grant from the Sloan Foundation awarded to the MIT Center for Cognitive Sci­
ence. I thank Donald Hoffman, Stephen Kosslyn, Jacques Mehler, Larry Parsons, Whitman Richards, and Ed
Smith for their detailed comments on an earlier draft, and Kathleen Murphy and Rosemary Krawcz\k for as­
sistance in preparing the manuscript. Reprint requests should be sent to Steven Pinker, Psychologv Depart­
ment, M .I.T., E10-018, Cambridge, MA 02139, U .S.A .


S. Pinker

negotiate environments skillfully, and to reason about the possible physical
interactions among objects present and absent. Thus visual cognition, no less
than language or logic, may be a talent that is central to our understanding
of human intelligence (Jackendoff, 1983; Johnson-Laird, 1983; Shepard and
Cooper, 1982).
Within the last 10 years there has been a great increase in our understand­
ing of visual cognitive abilities. We have seen not only new empirical de­
monstrations, but also genuinely new theoretical proposals and a new degree
of explicitness and sophistication brought about by the use of computational
modeling of visual and memory processes. Visual cognition, however, oc­
cupies a curious place within cognitive psychology and within the cognitive
psychology curriculum. Virtually without exception, the material on shape
recognition found in introductory textbooks in cognitive psychology would
be entirely familiar to a researcher or graduate student of 20 or 25 years ago.
Moreover, the theoretical discussions of visual imagery are cast in the same
loose metaphorical vocabulary that had earned the concept a bad name in
psychology and philosophy for much of this century. I also have the impres­
sion that much of the writing pertaining to visual cognition among researchers
who are not directly in this area, for example, in neuropsychology, individual
differences research, developmental psychology, psychophysics, and informa­
tion processing psychology, is informed by the somewhat antiquated and
imprecise discussions of visual cognition found in the textbooks.
The purpose of this special issue of Cognition is to highlight a sample of
theoretical and empirical work that is on the cutting edge of research on
visual cognition. The papers in this issue, though by no means a representa­
tive sample, illustrate some of the questions, techniques, and types of theory
that characterize the modern study of visual cognition. The purpose of this
introductory paper is to introduce students and researchers in neighboring
disciplines to a selection of issues and theories in the study of visual cognition
that provide a backdrop to the particular papers contained herein. It is meant
to bridge the gap between the discussions of visual cognition found in
textbooks and the level of discussion found in contemporary work.
Visual cognition can be conveniently divided into two subtopics. The first
is the representation of information concerning the visual world currently
before a person. When we behave in certain ways or change our knowledge
about the world in response to visual input, what guides our behavior or
thought is rarely some simple physical property of the input such as overall
brightness or contrast. Rather, vision guides us because it lets us know that
we are in the presence of a particular configuration of three-dimensional
shapes and particular objects and scenes that we know to have predictable
properties. ‘Visual recognition’ is the process that allows us to determine on

Visual cognition


the basis of retinal input that particular shapes, configurations of shapes,
objects, scenes, and their properties are before us.
The second subtopic is the process of remembering or reasoning about
shapes or objects that are not currently before us but must be retrieved from
memory or constructed from a description. This is usually associated with the
topic of ‘visual imagery’. This tutorial paper is divided into two major sec­
tions, devoted to the representation and recognition of shape, and to visual
imagery. Each section is in turn subdivided into sections discussing the
background to each topic, some theories on the relevant processes, and some
of the more important open issues that will be foci of research during the
coming years.
Visual recognition

Shape recognition is a difficult problem because the immediate input to the
visual system (the spatial distribution of intensity and wavelength across the
retinas—hereafter, the “retinal array”) is related to particular objects in
highly variable ways. The retinal image projected by an object—say, a
notebook—is displaced, dilated or contracted, or rotated on the retina when
we move our eyes, ourselves, or the book; if the motion has a component in
depth, then the retinal shape of the image changes and parts disappear and
emerge as well. If we are not focusing on the book or looking directly at it,
the edges of the retinal image become blurred and many of its finer details
are lost. If the book is in a complex visual context, parts may be occluded,
and the edges of the book may not be physically distinguishable from the
edges and surface details of surrounding objects, nor from the scratches,
surface markings, shadows, and reflections on the book itself.
Most theories of shape recognition deal with the indirect and ambiguous
mapping between object and retinal image in the following way. In long-term
memory there is a set of representations of objects that have associated with
them information about their shapes. The information does not consist of a
replica of a pattern of retinal stimulation, but a canonical representation of
the object’s shape that captures somç invariant properties of the object in all
its guises. During recognition, the retinal image is converted into the same
format as is used in long-term memory, and the memory representation that
matches the input the closest is selected. Different theories of shape recogni­
tion make different assumptions about the long-term memory representations
involved, in particular, how many representations a single object will have,
which class of objects will be mapped onto a single representation, and what
the format of the representation is (i.e. which primitive symbols can be found


S. Pinker

in a representation, and what kinds of relations among them can be
specified). They will differ in regards to which sports of preprocessing are
done to the retinal image (e.g., filtering, contrast enhancement, detection of
edges) prior to matching, and in terms of how the retinal input or memory
representations are transformed to bring them into closer correspondence.
And they differ in terms of the metric of goodness o f fit that determines
which memory representation fits the input best when none of them fits it
Traditional theories o f shape recognition
Cognitive psychology textbooks almost invariably describe the same three or
so models in their chapters on pattern recognition. Each of these models is
fundamentally inadequate. However, they are not always inadequate in the
ways the textbooks describe, and at times they are inadequate in ways that
the textbooks do not point out. An excellent introduction to three of these
models—templates, features, and structural descriptions—can be found in
Lindsay and Norman (1977); introductions to Fourier analysis in vision, which
forms the basis of the fourth model, can be found in Cornsweet (1980) and
Weisstein (1980). In this section I will review these models extremely briefly,
and concentrate on exactly why they do not work, because a catalogue of
their deficits sets the stage for a discussion of contemporary theories and
issues in shape recognition.
Template matching
This is the simplest class of models for pattern recognition. The long term
memory representation of a shape is a replica of a pattern of retinal stimula­
tion projected by that shape. The input array would be simultaneously
superimposed with all the templates in memory, and the one with the closest
above-threshold match (e.g., the largest ratio of matching to nonmatching
points in corresponding locations in the input array) would indicate the pat­
tern that is present.
Usually this model is presented not as a serious theory of shape recogni­
tion, but as a straw man whose destruction illustrates the inherent difficulty
of the shape recognition process. The problems are legion: partial matches
could yield false alarms (e.g., a ‘P’ in an ‘R’ template); changes in distance,
location, and orientation of a familiar object will cause this model to fail to
detect it, as will occlusion of part of the pattern, a depiction of it with wiggly
or cross-hatched lines instead of straight ones, strong shadows, and many
other distortions that we as perceivers take in stride.
There are, nonetheless, ways of patching template models. For example,

Visual cognition


multiple templates of a pattern, corresponding to each of its possible displace­
ments, rotations, sizes, and combinations thereof, could be stored. Or, the
input pattern could be rotated, displaced, and scaled to a canonical set of
values before matching against the templates. The textbooks usually dismiss
these possibilities: it is said that the product of all combinations of transforma­
tions and shapes would require more templates than the brain could store,
and that in advance of recognizing a pattern, one cannot in general determine
which transformations should be applied to the input. However, it is easy to
show that these dismissals are made too quickly. For example, Arnold Trehub
(1977) has devised a neural model of recognition and imagery, based on
templates, that addresses these problems (this is an example of a ‘massively
parallel’ model of recognition, a class of models I will return to later). Con­
tour extraction preprocesses feed the matching process with an array of sym­
bols indicating the presence of edges, rather than with a raw array of intensity
levels. Each template could be stored in a single cell, rather than in a space­
consuming replica of the entire retina: such a cell would synapse with many
retinal inputs, and the shape would be encoded in the pattern of strengths of
those synapses. The input could be matched in parallel against all the stored
memory templates, which would mutually inhibit one another so that partial
matches such as ‘P’ for ‘R’ would be eliminated by being inhibited by better
matches. Simple neural networks could center the input pattern and quickly
generate rotated and scaled versions of it at a variety of sizes and orientations,
or at a canonical size and orientation (e.g., with the shape’s axis of elongation
vertical); these transformed patterns could be matched in parallel against the
stored templates.
Nonetheless, there are reasons to doubt that even the most sophisticated
versions of template models would work when faced with realistic visual
inputs. First, it is unlikely that template models can deal adequately with the
third dimension. Rotations about any axis other than the line of sight cause
distortions in the projected shape of an object that cannot be inverted by any
simple operation on retina-like arrays. For example, an arbitrary edge might
move a large or a small amount across the array depending on the axis and
phase of rotation and the depth from the viewer. 3-D rotation causes some
surfaces to disappear entirely and new ones to come into view. These prob­
lems occur even if one assumes that the arrays are constructed subsequent to
stereopsis and hence are three-dimensional (for example, rear surfaces are
still not represented, there are a bewildering number of possible directions
of translation and axes of rotation, each requiring a different type of retinal
Second, template models work only for isolated objects, such as a letter
presented at the center of a blank piece of paper: the process would get


S. Pinker

nowhere if it operated, say, on three-fifths of a book plus a bit of the edge
of the table that it is lying on plus the bookmark in the book plus the end of
the pencil near it, or other collections of contours that might be found in a
circumscribed region of the retina. One could posit some figure-ground
segregation preprocess occurring before template matching, but this has prob­
lems of its own. Not only would such a process be highly complex (for exam­
ple, it would have to distinguish intensity changes in the image resulting from
differences in depth and material from those resulting from differences in
orientation, pigmentation, shadows, surface scratches, and specular (glossy)
reflections), but it probably interacts with the recognition process and hence
could not precede it. For example, the figure-ground segregation process
involves carving up a set of surfaces into parts, each of which can then be
matched against stored templates. This process is unlikely to be distinct from
the process of carving up a single object into its parts. But as Hoffman and
Richards (1984) argue in this issue, a representation of how an object is
decomposed into its parts may be the first representation used in accessing
memory during recognition, and the subsequent matching of particular parts,
template-style or not, may be less important in determining how to classify
a shape.
Feature models
This class of models is based on the early “Pandemonium” model of shape
recognition (Selfridge, 1959; Selfridge and Neisser, 1960). In these models,
there are no templates for entire shapes; rather, there are mini-templates or
‘feature detectors’ for simple geometric features such as vertical and horizon­
tal lines, curves, angles, ‘T’-junctions, etc. There are detectors for every
feature at every location in the input array, and these detectors send out a
graded signal encoding the degree of match between the target feature and
the part of the input array they are ‘looking at’. For every feature (e.g., an
open curve), the levels of activation of all its detectors across the input array
are summed, or the number of occurrences of the feature are counted (see
e.g., Lindsay and Norman, 1977), so the output of this first stage is a set of
numbers, one for each feature.
The stored representation of a shape consists of a list of the features com­
posing the shape, in the form of a vector of weights for the different features,
a list of how many tokens of each feature are present in the shape, or both.
For example, the representation of the shape of the letter ‘A ’ might specify
high weights for (1) a horizontal segment, (2) right-leaning diagonal segment,
(3) a left-leaning diagonal segment, (4) an upward-pointing acute angle, and
so on, and low or negative weights for curved and vertical segments. The
intent is to use feature weights or counts to give each shape a characterization

Visual cognition


that is invariant across transformations of it. For example, since the features
are all independent of location, any feature specification will be invariant
across translations and scale changes; and if features referring to orientation
(e.g. “left-leaning diagonal segment”) are eliminated, and only features dis­
tinguishing straight segments from curves from angles are retained, then the
description will be invariant across frontal plane rotations.
The match between input and memory would consist of some comparison
of the levels of activation of feature detectors in the input with the weights
of the corresponding features in each of the stored shape representations, for
example, the product of those two vectors, or the number of matching fea­
tures minus the number of mismatching features. The shape that exhibits the
highest degree of match to the input is the shape recognized.
The principal problem with feature analysis models of recognition is that
no one has ever been able to show how a natural shape can be defined in
terms of a vector of feature weights. Consider how one would define the
shape of a horse. Naturally, one could define it by giving high weights to
features like ‘mane’, ‘hooves’, ‘horse’s head’, and so on, but then detecting
these features would be no less difficult than detecting the horse itself. Or,
one could try to define the shape in terms of easily detected features such as
vertical lines and curved segments, but horses and other natural shapes are
composed of so many vertical lines and curved segments (just think of the
nose alone, or the patterns in the horse’s hide) that it is hard to believe that
there is a feature vector for a horse’s shape that would consistently beat out
feature vectors for other shapes across different views of the horse. One
could propose that there is a hierarchy of features, intermediate ones like
‘eye’ being built out of lower ones like ‘line segment’ or ‘circle’, and higher
ones like ‘head’ being built out of intermediate ones like ‘eye’ and ‘ear’
(Selfridge, for example, posited “computational demons” that detect Boolean
combinations of features), but no one has shown how this can be done for
complex natural shapes.
Another, equally serious problem is that in the original feature models the
spatial relationships among features—how they are located and oriented with
respect to one another—are generally not specified; only which ones are
present in a shape and perhaps how many times. This raises serious problems
in distinguishing among shapes consisting of the same features arranged in
different ways, such as an asymmetrical letter and its mirror image. For the
same reason, simple feature models can turn reading into an anagram prob­
lem, and can be shown formally to be incapable of detecting certain pattern
distinctions such as that between open and closed curves (see Minsky and
Papert, 1972).
One of the reasons that these problems are not often raised against feature


S. Pinker

models is that the models are almost always illustrated and referred to in
connection with recognizing letters of the alphabet or schematic line draw­
ings. This can lead to misleading conclusions because the computational prob­
lems posed by the recognition of two-dimensional stimuli composed of a
small number of one-dimensional segments may be different in kind from the
problems posed by the recognition of three-dimensional stimuli composed of
a large number of two-dimensional surfaces (e.g., the latter involves compen­
sating for perspective and occlusion across changes in the viewer’s vantage
point and describing the complex geometry of curved surfaces). Furthermore,
when shapes are chosen from a small finite set, it is possible to choose a
feature inventory that exploits the minimal contrasts among the particular
members of the set and hence successfully discriminates among those members,
but that could be fooled by the addition of new members to the set. Finally,
letters or line drawings consisting of dark figures presented against a blank
background with no other objects occluding or touching them avoids the
many difficult problems concerning the effects on edge detection of occlusion,
illumination, shadows, and so on.
Fourier models
Kabrisky (1966), Ginsburg (1971, 1973), and Persoon and Fu (1974; see
also Ballard and Brown, 1982) have proposed a class of pattern recognition
models that that many researchers in psychophysics and visual physiology
adopt implicitly as the most likely candidate for shape recognition in humans.
In these models, the two-dimensional input intensity array is subjected to a
spatial trigonometric Fourier analysis. In such an analysis, the array is decom­
posed into a set of components, each component specific to a sinusoidal
change in intensity along a single orientation at a specific spatial frequency.
That is, one component might specify the degree to which the image gets
brighter and darker and brighter and darker, etc., at intervals of 3° of visual
angle going from top right to bottom left in the image (averaging over changes
in brightness along the orthogonal direction). Each component can be con­
ceived of as a grid consisting of parallel black-and-white stripes of a particular
width oriented in a particular direction, with the black and white stripes
fading gradually into one another. In a full set of such grating-like compo­
nents, there is one component for each stripe width or spatial frequency (in
cycles per degree) at each orientation (more precisely, there would be a
continuum of components across frequencies and orientations).
A Fourier transform of the intensity array would consist of two numbers
for each of these components. The first number would specify the degree of
contrast in the image corresponding to that frequency at that orientation
(that is, the degree of difference in brightness between the bright areas and

Visual cognition


the dark areas of that image for that frequency in that orientation), or,
roughly, the degree to which the image ‘contains’ that set of stripes. The full
set of these numbers is the amplitude spectrum corresponding to the image.
The second number would specify where in the image the peaks and troughs
of the intensity change defined by that component lie. The full set of these
numbers of the phase spectrum corresponding to the image. The amplitude
spectrum and the phase spectrum together define the Fourier transform of
the image, and the transform contains all the information in the original
image. (This is a very crude introduction to the complex subject of Fourier
analysis. See Weisstein (1980) and Cornsweet (1970) for excellent nontechni­
cal tutorials).
One can then imagine pattern recognition working as follows. In long-term
memory, each shape would be stored in terms of its Fourier transform. The
Fourier transform of the image would be matched against the long-term
memory transforms, and the memory transform with the best fit to the image
transform would specify the shape that is recognized.1
How does matching transforms differ from matching templates in the orig­
inal space domain? When there is an exact match between the image and one
of the stored templates, there are neither advantages nor disadvantages to
doing the match in the transform domain, because no information is lost in
the transformation. But when there is no exact match, it is possible to define
metrics of goodness of fit in the transform domain that might capture some
of the invariances in the family of retinal images corresponding to a shape.
For example, to a first approximation the amplitude spectrum corresponding
to a shape is the same regardless of where in the visual field the object is
located. Therefore if the matching process could focus on the amplitude
spectra of shape and input, ignoring the phase spectrum, then a shape could
be recognized across all its possible translations. Furthermore, a shape and
its mirror image have the same amplitude spectrum, affording recognition of
a shape across reflections of it. Changes in orientation and scale of an object
result in corresponding changes in orientation and scale in the transform, but
in some models the transform can easily be normalized so that it is invariant
with rotation and scaling. Periodic patterns and textures, such as a brick wall,
are easily recognized because they give rise to peaks in their transforms
corresponding to the period of repetition of the pattern. But most important,
the Fourier transform segregates information about sharp edges and small
'In Persoon and Fu’s model (1974), it is not the transform of brightness as a function of visual field position
that is computed and matched, but the transform of the tangent angle of the boundary of an object as a
function of position along the boundary. This model shares many of the advantages and disadvantages of
Fourier analysis of brightness in shape recognition.


S. Pinker

details from information about gross overall shape. The latter is specified
primarily by the lower spatial-frequency components of the transform (i.e.,
fat gratings), the former, by the higher spatial-frequency components (i.e.
thin gratings). Thus if the pattern matcher could selectively ignore the higher
end of the amplitude spectrum when comparing input and memory transforms,
a shape could be recognized even if its boundaries are blurred, encrusted with
junk, or defined by wiggly lines, dots or dashes, thick bands, and so on.
Another advantage of Fourier transforms is that, given certain assumptions
about neural hardware, they can be extracted quickly and matched in parallel
against all the stored templates (see e.g., Pribram, 1971).
Upon closer examination, however, matching in the transform domain
begins to lose some of its appeal. The chief problem is that the invariances
listed above hold only for entire scenes or for objects presented in isolation.
In a scene with more than one object, minor rearrangements such as moving
an object from one end of a desk to another, adding a new object to the desk
top, removing a part, or bending the object, can cause drastic changes in the
transform. Furthermore the transform cannot be partitioned or selectively
processed in such a way that one part of the transform corresponds to one
object in the scene, and another part to another object, nor can this be done
within the transform of a single object to pick out its parts (see Hoffman and
Richards (1984) for arguments that shape representations must explicitly de­
fine the decomposition of an object into its parts). The result of these facts
is that it is difficult or impossible to recognize familiar objects in novel scenes
or backgrounds by matching transforms of the input against transforms of
the familiar objects. Furthermore, there is no straightforward way of linking
the shape information implicit in the amplitude spectrum with the position
information implicit in the phase spectrum so that the perceiver can tell
where objects are as well as what they are. Third, changes in the three-dimesional orientation of an object do not result in any simple cancelable change
in its transform, even it we assume that the visual system computes three-di­
mensional transforms (e.g., using components specific to periodic changes in
binocular disparity).
The appeal of Fourier analysis in discussions of shape recognition comes
in part from the body of elegant psychophysical research (e.g., Campbell and
Robson, 1968) suggesting that the visual system partitions the information in
the retinal image into a set of channels each specific to a certain range of
spatial frequencies (this is equivalent to sending the retinal information
through a set of bandpass filters and keeping the outputs of those filters
separate). This gives the impression that early visual processing passes on to
the shape recognition process not the original array but something like a
Fourier transform of the array. However, filtering the image according to its

Visual cognition


spatial frequency components is not the same as transforming the image into
its spectra. The psychophysical evidence for channels is consistent with the
notion that the recognition system operates in the space domain, but rather
than processing a single array, it processes a-family of arrays, each one con­
taining information about intensity changes over a different scale (or,
roughly, each one bandpass-filtered at a different center frequency). By pro­
cessing several bandpass-filtered images separately, one obtains some of the
advantages of Fourier analysis (segregation of gross shape from fine detail)
without the disadvantages of processing the Fourier transform itself (i.e. the
utter lack of correspondence between the parts of the representation and the
parts of the scene).
Structural descriptions
A fourth class of theories about the format in which visual input is matched
against memory holds that shapes are represented symbolically, as structural
descriptions (see Minsky, 1975; Palmer, 1975a; Winston, 1975). A structural
description is a data structure that can be thought of as a list of propositions
whose arguments correspond to parts and whose predicates correspond to
properties of the parts and to spatial relationships among them. Often these
propositions are depicted as a graph whose nodes correspond to the parts or
to properties, and whose edges linking the nodes correspond to the spatial
relations (an example of a structural description can be found in the upper
left portion of Fig. 6). The explicit representation of spatial relations is one
aspect of these models that distinguishes them from feature models and allows
them to escape from some of the problems pointed out by Minsky and Papert
One of the chief advantages of structural descriptions is that they can
factor apart the information in a scene without necessarily losing information
in it. It is not sufficient for the recognition system simply to supply a list of
labels for the objects that are recognized, for we need to know not only what
things are but also how they are oriented and where they are with respect to
us and each other, for example, when we are reaching for an object or
driving. We also need to know about the visibility of objects: whether we
should get closer, turn up the lights, or remove intervening objects in order
to recognize an object with more confidence. Thus the recognition process
in general must not boil away or destroy the information that is not diagnostic
of particular objects (location, size, orientation, visibility, and surface prop­
erties) until it ends up with a residue of invariant information; it must factor
apart or decouple this information from information about shape, so that
different cognitive processes (e.g., shape recognition versus reaching) can
access the information relevant to their particular tasks without becoming


S. Pinker

overloaded, distracted, or misled by the irrelevant information that the retina
conflates with the relevant information. Thus one of the advantages of a
structural description is that the shape of an object can be specified by one
set of propositions, and its location in the visual field, orientation, size, and
relation to other objects can be specified in different propositions, each bear­
ing labels that processing operations can use for selective access to the infor­
mation relevant to them.
Among the other advantages of structural descriptions are the following.
By representing the different parts of an object as separate elements in the
representation, these models break up the recognition process into simpler
subprocesses, and more important, are well-suited to model our visual sys­
tem’s reliance on decomposition into parts during recognition and its ability
to recognize novel rearrangements of parts such as the various configurations
of a hand (see Hoffman and Richards (1984)). Second, by mixing logical and
spatial relational terms in a representation, structural descriptions can dif­
ferentiate among parts that must be present in a shape (e.g., the tail of the
letter ‘Q’), parts that may be present with various probabilities (e.g., the
horizontal cap on the letter T ), and parts that must not be present (e.g., a
tail on the letter ‘O’) (see Winston, 1975). Third, structural descriptions
represent information in a form that is useful for subsequent visual reasoning,
since the units in the representation correspond to objects, parts of objects,
and spatial relations among them. Nonvisual information about objects or
parts (e.g., categories they belong to, their uses, the situations that they are
typically found in) can easily be associated with parts of structural descrip­
tions, especially since many theories hold that nonvisual knowledge is stored
in a propositional format that is similar to structural descriptions (e.g.,
Minsky, 1975; Norman and Rumelhart, 1975). Thus visual recognition can
easily invoke knowledge about what is recognized that may be relevant to
visual cognition in general, and that knowledge in turn can be used to aid in
the recognition process (see the discussion of top-down approaches to recog­
nition below).
The main problem with the structural description theory is that it is not
really a full theory of shape recognition. It specifies the format of the rep­
resentation used in matching the visual input against memory, but by itself it
does not specify what types of entities and relations each of the units belong­
ing to a structural description corresponds to (e.g., ‘line’ versus ‘eye’ versus
‘sphere4; ‘next-to’ versus ‘to-the-right-of’ versus ‘37-degrees-with-respect-to’),
nor how the units are created in response to the appropriate patterns of
retinal stimulation (see the discussion of feature models above). Although
most researchers in shape recognition would not disagree with the claim that
the matching process deals with something like structural descriptions, a

Visual cognition


genuine theory of shape recognition based on structural descriptions must
specify these components and justify why they are appropiate. In the next
section, I discuss a theory proposed by David Marr and H. Keith Nishihara
which makes specific proposals about each of these aspects of structural de­
Two fundamental problems with the traditional approaches
There are two things wrong with the textbook approaches to visual represen­
tation and recognition. First, none of the theories specifies where perception
ends and where cognition begins. This is a problem because there is a natural
factoring part of the process that extracts information about the geometry of
the visible world and the process that recognizes familiar objects. Take the
recognition of a square. We can recognize a square whether its contours are
defined by straight black lines printed on a white page, by smooth rows and
columns of arbitrary small objects (Kohler, 1947; Koffka, 1935), by differ­
ences in lightness or in hue between the square and its background, by differ­
ences in binocular disparity (in a random-dot stereogram), by differences in
the orientation or size of randomly scattered elements defining visual textures
(Julesz, 1971), by differences in the directions of motion of randomly placed
dots (Ullman, 1982; Marr, 1982), and so on. The square can be recognized
as being a square regardless of how the boundaries are found; for example,
we do not have to learn the shape of a square separately for boundaries
defined by disparity in random-dot stereograms, by strings of asterisks, etc.,
nor must we learn the shapes of other figures separately for each type of edge
once we have learned how to do so for a square. Conversely, it can be
demonstrated that the ultimate recognition of the shape is not necessary for
any of these processes to find the boundaries (the boundaries can be seen
even if the shape they define is an unfamiliar blob, and expecting to see a
square is neither necessary nor sufficient for the perceiver to see the bound­
aries; see Gibson, 1966; Marr, 1982; Julesz, 1971). Thus the process that
recognizes a shape does not care about how its boundaries were found, and
the processes that find the boundaries do not care how they will be used. It
makes sense to separate the process of finding boundaries, degree of curva­
ture, depth, and so on, from the process of recognizing particular shapes (and
from other processes such as reasoning that can take their input from vision).
A failure to separate these processes has tripped up the traditional ap­
proaches in the following ways. First, any theory that derives canonical shape
representations directly from the retinal arrays (e.g., templates, features) will
have to solve all the problems associated with finding edges (see the previous
paragraph) at the same time as solving the problem of recognizing particular


S. Pinker

shapes—an unlikely prospect. On the other hand, any theory that simply
assumes that there is some perceptual processing done before the shape
match but does not specify what it is is in danger of explaining very little since
the putative preprocessing could solve the most important part of the recog­
nition process that the theory is supposed to address (e.g., a claim that a
feature like ‘head’ is supplied to the recognition process). When assumptions
about perceptual preprocessing are explicit, but are also incorrect or unmoti­
vated, the claims of the recognition theory itself could be seriously under­
mined: the theory could require that some property of the world is supplied
to the recognition process when there is no physical basis for the perceptual
system to extract that property (e.g., Marr (1982) has argued that it is impos­
sible for early visual processes to segment a scene into objects).
The second problem with traditional approaches is that they do not pay
serious attention to what in general the shape recognition process has to do,
or, put another way, what problem it is designed to solve (see Marr, 1982).
This requires examining the input and desired output of the recognition pro­
cess carefully: on the one hand, how the laws of optics, projective geometry,
materials science, and so on, relate the retinal image to the external world,
and on the other, what the recognition process must supply the rest of cogni­
tion with. Ignoring either of these questions results in descriptions of recog­
nition mechanisms that are unrealizable, useless, or both.
The Marr-Nishihara theory
The work of David Marr represents the most concerted effort to examine the
nature of the recognition problem, to separate early vision from recognition
and visual cognition in general, and to outline an explicit theory of three-di­
mensional shape recognition built on such foundations. In this section, I
will briefly describe Marr’s theory. Though Marr’s shape recognition model
is not without its difficulties, there is a consensus that it addresses the most
important problems facing this class of theories, and that its shortcomings
define many of the chief issues that researchers in shape recognition must
The 2*12-0 sketch
The core of Marr’s theory is a claim about the interface between perception
and cognition, about what early, bottom-up visual processes supply to the
recognition process and to visual cognition in general. Marr, in collaboration
with H. Keith Nishihara, proposed that early visual processing culminates in
the construction of a representation called the 2V2-D sketch. The 2V2-D sketch
is an array of cells, each cell dedicated to a particular line of sight from the

Visual cognition


viewer’s vantage point. Each cell in the array is filled with a set of symbols
indicating the depth of the local patch of surface lying on that line of sight,
the orientation of that patch in terms of the degree and direction in which it
dips away from the viewer in depth, and whether an edge (specifically, a
discontinuity in depth) or a ridge (specifically, a discontinuity in orientation)
is present at that line of sight (see Fig. 1). In other words, it is a representation
of the surfaces that are visible when looking in a particular direction from a
single vantage point. The 2V2-D sketch is intended to gather together in one
representation the richest information that early visual processes can deliver.
Marr claims that no top-down processing goes into the construction of the
2V2-D sketch, and that it does not contain any global information about shape
(e.g., angles between lines, types of shapes, object or part boundaries), only
depths and orientations of local pieces of surface.
The division between the early visual processes that culminate in the 2V2-D
sketch and visual recognition has an expository as well as a theoretical advan­
tage: since the early processes are said not to be a part of visual cognition
Figure 1 Schematic drawing of Marr and Nishiharas 2lh-D sketch. Arrows represent
surface orientation of patches relative to the viewer (the heavy dots are
foreshortened arrows). The dotted line represents locations where orienta­
tion changes discontinuously (ridges). The solid line represents locations
where depth changes discontinuously (edges). The depths of patches relative
to the viewer are also specified in the 2lh-D sketch but are not shown in this
figure. From Marr (1982).


S. Pinker

(i.e., not affected by a person’s knowledge or intentions), I will discuss them
only in bare outline, referring the reader to Marr (1982) and Poggio (1984)
for details. The 2V2-D sketch arises from a chain of processing that begins
with mechanisms that convert the intensity array into a representation in
which the locations of edges and other surface details are made explicit. In
this ‘primal sketch’, array cells contain symbols that indicate the presence of
edges, corners, bars, and blobs of various sizes and orientations at that loca­
tion. Many of these elements can remain invariant over changes in overall
illumination, contrast, and focus, and will tend to coincide in a relatively
stable manner with patches of a single surface in the world. Thus they are
useful in subsequent processes that must examine similarities and differences
among neighboring parts of a surface, such as gradients of density, size, or
shape of texture elements, or (possibly) processes that look for corresponding
parts of the world in two images, such as stereopsis and the use of motion to
reconstruct shape.
A crucial property of this representation is that the edges and other fea­
tures are extracted separately at a variety of scales. This is done by looking
for points where intensity changes most rapidly across the image using detec­
tors of different sizes that, in effect, look at replicas of the image filtered at
different ranges of spatial frequencies. By comparing the locations of intensity
changes in each of the (roughly) bandpass-filtered images, one can create
families of edge symbols in the primal sketch, some indicating the boundaries
of the larger blobs in the image, others indicating the boundaries of finer
details. This segregation of edge symbols into classes specific to different
scales preserves some of the advantages of the Fourier models discussed
above: shapes can be represented in an invariant manner across changes in
image clarity and surface detail (e.g., a person wearing tweeds versus polyes­
The primal sketch is still two-dimensional, however, and the next stage of
processing in the Marr and Nishihara model adds the third dimension to
arrive at the 2V2-D sketch. The processes involved at this stage compute the
depths and orientations .of local patches of surfaces using the binocular dispar­
ity of corresponding features in the retinal images from the two eyes (e.g.,
Marr and Poggio, 1977), the relative degrees of movement of features in
successive views (e.g., Ullman, 1979), changes in shading (e.g., Horn, 1975),
the size and shape of texture elements across the retina (Cutting and Millard,
1984; Stevens, 1981), the shapes of surface contours, and so on. These proces­
ses cannot indicate explicitly the overall three-dimensional shape of an object,
such as whether it is a sphere or a cylinder; their immediate output is simply
a set of values for each patch of a surface indicating its relative distance from
the viewer, orientation with respect to the line of sight, and whether either

Visual cognition


depth or orientation changes discontinuously at that patch (i.e., whether an
edge or ridge is present).
The 2V2-D sketch itself is ill-suited to matching inputs against stored shape
representations for several reasons. First, only the visible surfaces of shapes
are represented; for obvious reasons, bottom-up processing of the visual input
can provide no information about the back sides of opaque objects. Second,
the 2V2-D sketch is viewpoint-specific; the distances and orientations of
patches of surfaces are specified with respect to the perceiver’s viewing pos­
ition and viewing direction, that is, in part of a spherical coordinate system
centered on the viewer’s vantage point. That means that as the viewer or the
object moves with respect to one another, the internal representation of the
object in the 2V2-D sketch changes and hence does not allow a successful
match against any single stored replica of a past 2V2-D representation of the
object (see Fig. 2a). Furthermore, objects and their parts are not explicitly

Figure 2. The orientation of a hand with respect to the retinal vertical V (a viewer-cen­
tered reference frame), the axis of the body B (a global object-centered
reference frame), and the axis of the lower arm A (a local object-centered
reference frame). The retinal angle of the hand changes with rotation of the
whole body (middle panel); its angle with respect to the body changes with
movement of the elbow and shoulder (right panel). Only its angle with
respect to the arm remains constant across these transformations.


S. Pinker

Shape recognition and 3-D models
Marr and Nishihara (1978) have proposed that the shape recognition pro­
cess (a) defines a coordinate system that is centered on the as-yet unrecog­
nized object, (b) characterizes the arrangement of the object’s parts with
respect to that coordinate system, and (c) matches such characterizations
against canonical characterizations of objects’ shapes stored in a similar for­
mat in memory. The object os described with respect to a coordinate system
that is centered on the object (e.g., its origin lies on some standard point on
the object and one or more of its axes are aligned with standard parts of the
object), rather than with respect to the viewer-centered coordinate system of
the 2V2-D sketch, because even though the locations of the object’s parts with
respect to the viewer change as the object as a whole is moved, the locations
of its parts with respect to the object itself do not change (see Fig. 2b). A
structural description representing an object’s shape in terms of the arrange­
ment of its parts, using parameters whose meaning is determined by a coor­
dinate system centered upon that object, is called the 3-D model description
in Marr and Nishihara’s theory.
Centering a coordinate system on the object to be represented solves only
some of the problems inherent in shape recognition. A single object-centered
description of a shape would still fail to match an input object when the
object bends at its joints (see Fig. 2c), when it bears extra small parts (e.g.,
a horse with a bump on its back), or when there is a range of variation among
objects within a class. Marr and Nishihara address this stability problem by
proposing that information about the shape of an object is stored not in a
single model with a global coordinate system but in a hierarchy of models
each representing parts of different sizes and each with its own coordinate
system. Each of these local coordinate systems is centered on a part of the
shape represented in the model, aligned with its axis of elongation, symmetry,
or (for movable parts) rotation.
For example, to represent the shape of a horse, there would be a top-level
model with a coordinate system centered on the horse’s torso. That coordi­
nate system would be used to specify the locations, lengths, and angles of the
main parts of the horse: the head, limbs, and tail. Then subordinate models
are defined for each of those parts: one for the head, one for the front right
leg, etc. Each of those models would contain a coordinate system centered
on the part that the model as a whole represents, or on a part subordinate
to that part (e.g., the thigh for the leg subsystem). The coordinate system for
that model would be used to specify the positions, orientations, and lengths
of the subordinate parts that comprise the part in question. Thus, within the
head model, there would be a specification of the locations and angles of the
neck axis and of the head axis, probably with respect to a coordinate system

Visual cognition


centered on the neck axis. Each of these parts would in turn get its own
model, also consisting of a coordinate axis centered on a part, plus a charac­
terization of the parts subordinate to it. An example of a 3-D model for a
human shape is shown in Fig. 3.
Employing a hierarchy of corrdinate systems solves the stability problems
alluded to above, because even though the position and orientation of the
hand relative to the torso can change wildly and unsystematically as a person
bends the arm, the position of the hand relative to the arm does not change
(except possibly by rotating within the range of angles permitted by bending
of the wrist). Therefore the description of the shape of the arm remains
constant only when the arrangement of its parts is specified in terms of
angles and positions relative to the arm axis, not relative to the object as a
whole (see Fig. 2). For this to work, of course, positions, lengths, and angles
must be specified in terms of ranges (see Fig. 3d) rather than by precise
values, so as to accommodate the changes resulting from movement or indi­
vidual variation among exemplars of a shape. Note also that the hierarchical
arrangement of 3-D models compensates for individual variation in a second
way: a horse with a swollen or broken knee, for example, will match the 3-D
model defining the positions of a horse’s head, torso, limbs, and tail relative
to the torso axis, even if the subordinate limb model itself does not match
the input limb.
Organization and accessing o f shape information in memory
Marr and Nishihara point out that using the 3-D model format, it is pos­
sible to define a set of values at each level of the hierarchy of coordinate
systems that correspond to a central tendency among the members of well-de­
fined classes of shapes organized around a single ‘plan’. For example, at the
top level of the hierarchy defining limbs with respect to the torso, one can
define one set of values that most quadruped shapes cluster around, and a
different set of values that most bird shapes cluster around. At the next level
down one can define values for subclasses of shapes such as songbirds versus
long-legged waders.
This modular organization of shape descriptions, factoring apart the ar­
rangement of parts of a given size from the internal structure of those parts,
and factoring apart shape of an individual type from the shape of the class
of objects it belongs to, allows input descriptions to be matched against mem­
ory in a number of ways. Coarse information about a shape specified in a
top-level coordinate system can be matched against models for general classes
(e.g., quadupeds) first, constraining the class of shapes that are checked the
next level down, and so on. Thus when recognizing the shape of a person,
there is no need to match it against shape descriptions of particular types of


S. Pinker





Lower arm





Origin location

Part orientation








WW '





upper arm
lower arm























|-*BB^|**CC**| ♦!)!)*►


'J ^








H .„ <




Visual cognition


guppies, parakeets, or beetles once it has been concluded that the gross shape
is that of a primate. (Another advantage of using this scheme is that if a shape
is successfully matched at a higher level but not at any of the lower levels, it
can still be classified as failing into a general class or pattern, such as being
a bird, even if one has never encountered that type of bird before). An
alternative way of searching shape memory is to allow the successful recogni­
tion of a shape in a high-level model to trigger the matching of its subordinate
part-models against as-yet unrecognized parts in the input, or to allow the
successful recognition of individual parts to trigger the matching of their
superordinate models against the as-yet unrecognized whole object in the
input containing that part. (For empirical studies on the order in which shape
representations are matched against inputs, see Jolicoeur et al. 1984a; Rosch
et al. 1976; Smith et al. 1978. These studies suggest that the first index into
shape memory may be at a ‘basic object’ level, rather than the most abstract
level, at least for prototypical exemplars of a shape.)
Representing shapes o f parts
Once the decomposition of a shape into its component axes is ac­
complished, the shapes of the components that are centered on each axis
must be specified as well. Marr and Nishihara conjecture that shapes of parts
may be described in terms of generalized cones (Binford, 1971). Just as a cone
can be defined as the surface traced out when a circle is moved along a
straight line perpendicular to the circle while its diameter steadily shrinks, a
generalized cone can be defined as the surface traced out when any planar
closed shape is moved along any smooth line with its size smoothly changing
in any way. Thus to specify a particular generalized cone, one must specify
Figure 3. Marr and Nishisharas 3-D model description for a human shape. A shows
how the whole shape is decomposed into a hierarchy of models, each en­
closed by a rectangle. B shows the information contained in the model
description: the subordinate models contained in each superordinate, and
the location and orientation of the defining axis of each subordinate with
respect to a coordinate system centered on a part of the superordinate. The
meanings of the symbols used in the model are illustrated in C and D: the
endpoint of a subordinate axis is defined by three parameters in a cylindrical
coordinate system centered on a superordinate part (left panel of C); the
orientation and length of the subordinate axis are defined by three paramet­
ers in a spherical coordinate system centered on the endpoint and aligned
with the superordinate part (right panel of C). Angles and lengths are
specified by ranges rather than by exact values (D). From Marr and Nishih­
ara (1978).


S. Pinker

the shape of the axis (i.e., how it bends, if at all), the two-dimensional shape
of the generalized cone’s cross-section, and the gradient defining how its area
changes as a function of position along the axis. (Marr and Nishihara point
out that shapes formed by biological growth tend to be well-modeled by
generalized cones, making them good candidates for internal representations
of the shapes of such parts.) In addition, surface primitives such as rectangu­
lar, circular, or bloblike markings can also be specified in terms of their
positions with respect to the axis model.
Deriving 3-D descriptions from the 2lh-D sketch
Unfortunately, this is an aspect of the Marr and Nishihara model that has
not been developed in much detail. Marr and Nishihara did outline a limited
process for deriving 3-D descriptions from the two-dimensional silhouette of
the object. The process first carves the silhouette into parts at extrema of
curvature, using a scheme related to the one proposed by Hoffman and
Richards (1984). Each part is given an axis coinciding with its direction of
elongation, and lines are created joining endpoints to neighboring axes. The
angles between axes and lines are measured and recorded, the resulting de­
scription is matched against top-level models in memory, and the bestmatched model is chosen. At that point, constraints on how a part is situated
and oriented with respect to the superordinate axis in that model can be used
to identify the viewer-relative orientation of the part axis in the 2V2-D sketch.
That would be necessary if the orientation of that part cannot be determined
by an examination of the sketch itself, such as when its axis is pointing toward
the viewer and hence is foreshortened. Once the angle of an axis is specified
more precisely, it can be used in selecting subordinate 3-D models for sub­
sequent matching.
The Marr and Nishihara model is the most influential contemporary model
of three-dimensional shape recognition, and it is not afflicted by many of the
problems that afflict the textbook models of shape representation sum­
marized earlier. Nonetheless, the model does have a number of problems,
which largely define the central issues to be addressed in current research on
shape recognition. In the next section, I summarize some of these problems
Current problems in shape recognition research
Choice o f shape primitives to represent parts
The shape primitives posited by Marr and Nishihara—generalized cones
centered on axes of elongation or symmetry—have two advantages: they can

Visual cognition


easily characterize certain important classes of objects, such as living things,
and they can easily be derived from their silhouettes. But Hoffman and
Richards (1984) point out that many classes of shapes cannot be easily de­
scribed in this scheme, such as faces, shoes, clouds, and trees. Hoffman and
Richards take a slightly different approach to the representation of parts in
a shape description. They suggest that the problem of describing parts (i.e.,
assigning them to categories) be separated from the problem of finding parts
(i.e., determining how to carve an object into parts). If parts are only found
by looking for instances of certain part categories (e.g., generalized cones)
then parts that do not belong to any of those categories would never be
found. Hoffman and Richards argue that, on the contrary, there is a
psychologically plausible scheme for finding part boundaries that is ignorant
of the nature of the parts it defines. The parts delineated by these boundaries
at each scale can be categorized in terms of a taxonomy of lobes and blobs
based on the patterns of inflections and extrema of curvature of the lobe’s
surface. (Hoffman (1983) has worked out a taxonomy for primitive shape
descriptors, called ‘codons’, for two-dimensional plane curves). They argue
not only that the decomposition of objects into parts is more basic for the
purposes of recognition than the description of each part, but that the deriva­
tion of part boundaries and the classification of parts into sequences of codon­
like descriptors might present fewer problems than the derivation of axisbased descriptions, because the projective geometry of extrema and inflec­
tions of curvature allows certain reliable indicators of these extrema in the
image to be used as a basis for identifying them (see Hoffman, 1983).
Another alphabet of shape primitives that has proven useful in computer
vision consists of a set of canonical volumetric shapes such as spheres,
parallelopipeds, pyramids, cones, and cylinders, with parameterized sizes and
(possibly) aspect ratios, joined together in various ways to define the shape
of an object (see e.g., Hollerbach, 1975; Badler and Bajcsy, 1978). It is
unlikely that a single class of primitives will be sufficient to characterize all
shapes, from clothes lying in a pile to faces to animals to furniture. That
means that the derivation process must be capable of determining prior to
describing and recognizing a shape which type of primitives are appropriate
to it. There are several general schemes for doing this. A shape could be
described in parallel in terms of all the admissible representational schemes,
and descriptions in inappropriate schemes could be rejected because they are
unstable over small changes in viewing position or movement, or because no
single description within a scheme can be chosen over a large set of others
within that scheme. Or there could be a process that uses several coarse
properties of an object, such as its movement, surface texture and color,
dimensionality, or sound to give it an initial classification into broad cate-


S. Pinker

gories such as animal versus plant versus artifact each with its own scheme of
primitives and their organization (e.g., see Richards (1979, 1982) on “playing
20 questions” with the perceptual input).
Assigning frames o f reference to a shape
In a shape representation, size, location, and orientation cannot be
specified in absolute terms but only with respect to some frame of reference.
It is convenient to think of a frame of reference as a coordinate system
centered on or aligned with the reference object, and transformations within
or between reference frames as being effected by an analogue of matrix
multiplication taking the source coordinates as input and deriving the destina­
tion coordinates as output. However, a reference frame need not literally be
a coordinate system. For example, it could be an array of arbitrarily labelled
cells, where each cell represents a fixed position relative to a reference object.
In that case, transformations within or between such reference frames could
be effected by fixed connections among corresponding source and destination
cells (e.g., a network of connections linking each cell with its neighbor to the
immediate right could effect translation when activated iteratively; see e.g.,
Trehub, 1977).
If a shape is represented for the purpose of recognition in terms of a
coordinate system or frame of reference centered on the object itself, the
shape recognition system must have a way of determining what the objectcentered frame of reference is prior to recognizing the object. Marr and
Nishihara conjecture that a coordinate system used in recognition may be
aligned with an object’s axes of elongation, bilateral symmetry, radial sym­
metry (for objects that are radially symmetrical in one plane and extended
in an orthogonal direction), rotation (for jointed objects), and possibly linear
movement. Each of these is suitable for aligning a coordinate system with an
object because each is derivable prior to object recognition and each is fairly
invariant for a type of object across changes in viewing position.
This still leaves many problems unsolved. For starters, these methods only
fix the orientation of one axis of the cylindrical coordinate system. The direc­
tion of the cylindrical coordinate system for that axis (i.e., which end is zero),
the orientation of the zero point of its radial scale, and the handedness of the
radial scale (i.e., whether increasing angle corresponds to going clockwise or
counterclockwise around the scale) are left unspecified, as is the direction of
one of the scales used in the spherical coordinate system specified within the
cylindrical one (assuming its axes are aligned with the axis of the cylindrical
system and the line joining it to the cylindrical system) (see Fig. 3c). Further­
more, even the choice of the orientation of the principal axis will be difficult
when an object is not elongated or symmetrical, or when the principal axis

Visual cognition


is occluded, foreshortened, or physically compressed. For example, it the
top-level description of a cow shape describes the dispositions of its parts with
respect to the cow’s torso, then when the cow faces the viewer the torso is
not visible, so there is no way for the visual system to describe, say, the
orientations of the leg and head axes relative to its axis.
There is evidence that our assignment of certain aspects of frames of refer­
ence to an object is done independently of its intrinsic geometry. The posi­
tive-negative direction of an intrinsic axis, or the assignment of an axis to an
object when there is no elongation or symmetry, may be done by computing
a global up-down direction. Rock (1973, 1983) presents extensive evidence
showing that objects’ shapes are represented relative to an up-down direc­
tion. For example, a square is ordinarily ‘described’ internally as having a
horizontal edge at the top and bottom; when the square is tilted 45°, it is
described as having vertices at the top and bottom and hence is perceived as
a different shape, namely, a diamond. The top of an object is not, however,
necessarily the topmost part of the object’s projection on the retina: Rock
has shown that when subjects tilt their heads and view a pattern that, un­
known to them, is tilted by the same amount (so that it projects the same
retinal image), they often fail to recognize it. In general, the up-down direc­
tion seems to be assigned by various compromises among the gravitational
upright, the retinal upright, and the prevailing directions of parallelism,
pointing, and bilateral symmetry among the various features in the environ­
ment of the object (Attneave, 1968; Palmer and Bucher, 1981; Rock, 1973).
In certain circumstances, the front-back direction relative to the viewer may
also be used as a frame of reference relative to which the shape is described;
Rock et al. (1981) found that subjects would fail to recognize a previouslylearned asymmetrical wire form when it was rotated 90° about the vertical axis.
What about the handedness of the angular scale in a cylindrical coordinate
system (e.g., the 6 parameter in Fig. 3)? One might propose that the visual
system employs a single arbitrary direction of handedness for a radial scale
that is uniquely determined by the positive-negative direction of the long axis
orthogonal to the scale. For example, we could use something analogous to
the ‘right hand rule’ taught to physics students in connection with the orien­
tation of a magnetic field around a wire (align the extended thumb of your
right hand with the direction of the flow of current, and look which way your
fingers curl). There is evidence, however, that the visual system does not use
any such rule. Shepard and Hurwitz (1984, in this issue; see also Hinton and
Parsons, 1981; Metzler and Shepard, 1975) point out that we do not in general
determine how parts are situated or oriented with respect to the left-right
direction on the basis of the intrinsic geometry of the object (e.g., when we
are viewing left and right hands). Rather, we assign the object a left-right


S. Pinker

direction in terms of our own egocentric left and right sides. When an object’s
top and bottom do not correspond to an egocentric or gravitational top-bot­
tom direction, we mentally rotate it into such an orientation, and when two
unfamiliar objects might differ in handedness, we rotate one into the orienta­
tion of the other (taking greater amounts of time for greater angles of rota­
tion. Mental rotation is discussed further later in this paper). Presumably this
failure to assign objects intrinsic left and right directions is an evolutionary
consequence of the fact that aside from human artifacts and body parts,
virtually no class of ecologically significant shapes need be distinguished from
their enantiomorphs (Corballis and Beale, 1976; Gardner, 1967).
To the extent that a shape is described with respect to a reference frame
that depends on how the object is oriented with respect to the viewer or the
environment, shape recognition will fail when the object moves with respect
to the viewer or environment. In cases where we do succeed at recognizing
objects across its different dispositions and where object-centered frames
cannot be assigned, there are several possible reasons for such success. One
is that multiple shape descriptions, corresponding to views of the object with
different major axes occluded, are stored under a single label and correspond­
ing parts in the different descriptions are linked. Another is that the represen­
tation of the object is rotated into a canonical orientation or until the descrip­
tion of the object relative to the frame matches a memorized shape descrip­
tion; alternatively, the reference frame or canonical orientation could be
rotated into the orientation of the object. Interestingly, there is evidence
from Cooper and Shepard (1973) and Shepard and Hurwitz (1984) that the
latter option (rotating an empty reference frame) is difficult or impossible for
humans to do: advance information about the orientation of an upcoming
visual stimulus does not spare the perceiver from having to rotate the stimulus
mentally when it does appear in order to judge its handedness.2 A third
possibility stems from Hoffman and Richards’s (1984) suggestion that part
segmentation may be independent of orientation, and that only the represen­
tations of spatial relations among parts are orientation-sensitive. If so, recog­
nition of an isolated part can be used as an index to find the objects in
memory that contain that part. Finally, in some cases recognition might fail
outright with changes in orientation but the consequences might be innocu-

‘Hinton and Parsons (1981) have shown that when the various stimuli to be judged all conform to a single
shape schema (e.g., alphanumeric characters with a vertical spine and strokes attached to the right side of the
spine, such as ‘R’, ‘L \ and ‘F ) , advance information about orientation saves the subject from having to rotate
the stimulus. However, it is possible that in their experiment subjects rotated a concrete image of a vertical
spine plus a few strokes, rather than an empty reference frame.

Visual cognition


ous. Because of the pervasiveness of gravity, many shapes will rarely be seen
in any position but the upright (e.g., faces, trees), and many of the differences
in precise shape among objects lacking axes of symmetry, movement, rota­
tion, or elongation are not ecologically significant enough for us to distinguish
among them in memory (e.g., differences among bits of gravel or crumpled
newspaper). Naturally, to the extent that any of the suggestions made in this
paragraph are true, the importance of Marr and Nishihara’s argument for
canonical object-centered descriptions lessens.3
Frames o f reference for the visual field
We not only represent the shapes of objects internally; we also represent
the locations and orientations of objects and surfaces in the visual field. The
frames of reference that we use to represent this information will determine
the ease with which we can make various spatial judgments. The relevant
issues are the alignment of the frames of reference, and the form of the
frames of reference.
Early visual representations are in a viewer-centered and approximately
spherical frame of reference; that is, our eyes give us information about the
world in terms of the azimuth and elevation of the line of sight at which the
features are found relative to the retina, and their distance from the viewing
position (this is the coordinate system used for the 2V2-D sketch). Naturally,
this is a clumsy representation to use in perceiving invariant spatial relations,
since the information will change with eye movements. The system can com­
pensate for eye movements by superimposing a head-centered coordinate
system on top of the retinal system and moving the origin of that coordinate
system in conjunction with eye movement commands. Thus every cell in the
2V2-D sketch would be represented by the fixed ‘address’ defined with respect
to the retina, and also by its coordinates with respect to the head, which
would be dynamically adjusted during eye movements so that fixed locations
in the world retain a constant coordinate address within the head-centered
system. A third coordinate system, defined over the same information, could
represent position with respect to the straight ahead direction of the body

Specifying the origin of the object-centered coordinate system presents a slightly different set of issues
than specifying the orientation of its axes. An origin for an object-centered frame can be determined by finding
its visual center of mass or by assigning it to one end of a principal axis. It is noteworthy that there are no
obvious cases where we fail to recognize an object when it is displaced, where we see a shape as ambiguous
by virtue of assigning different ‘centers’ or ‘reference locations’ to it (analogous to the diamond/tilted square
ambiguity), or where we have to mentally translate an object in order to recognize it or match it against a
comparison object. This indicates either that the procedure that assigns an origin to an object on the basis of
its intrinsic geometry always yields a unique solution for an object, or that, as Hinton (1979a) suggests, we
do not compute an origin at all in shape descriptions, only a set of significant directions.


S. Pinker

and it could be updated during head movements to represent the invariant
position of surfaces across those movements. Other coordinate systems could
be defined over these visible surface representations as well, such as coordi­
nate systems aligned with the gravitational upright and horizontal ground
(see Shepard and Hurwitz, 1984), with fixed salient landmarks in the world,
or with the prevailing directions of large surfaces (e.g., the walls in a tilted
room). These coordinate systems for objects’ positions with respect to one’s
body or with respect to the environment could be similar to those used to
represent the parts of an object with respect to the object as a whole. Presum­
ably they are also like coordinate systems for objects’ shapes in being or­
ganized hierarchically, so that a paper clip might be represented by its posi­
tion with respect to the desk tray it is in, whose position is specified with
respect to the desk, whose position is specified with respect to the room.
Beyond the visual world, the origin and orientation of large frames of refer­
ence such as that for a room could be specified in a hierarchy of more schema­
tic frames of reference for entities that cannot be seen in their entirety, such
as those for floor plans, buildings, streets, neighborhoods and so on (see e.g.,
Kuipers, 1978; Lynch, 1960; McDermott, 1980).
The possible influence of various frames of reference on shape perception
can be illustrated by an unpublished experiment by Laurence Parsons and
Geoff Hinton. They presented subjects with two Shepard-Metzler cube fi­
gures, one situated 45° to the left of the subject, another at 45° to the right.
The task was to turn one of the objects (physically) to whatever orientation
best allowed the subject to judge whether the two were identical or whether
one was a mirror-reversed version of the other (subjects were allowed to
move their heads around the neck axis). If objects were represented in coor­
dinate systems centered upon the objects themselves, subjects would not
have to turn the object at all (we known from the Shepard and Metzler
studies that this is quite unlikely to be true for these stimuli). If objects are
represented in a coordinate system aligned with the retina, subjects should
turn one object until the corresponding parts of the two objects are perpen­
dicular to the other, so that they will have the same orientations with respect
to their respective lines of sight. And if shapes are represented in a coordinate
system aligned with salient environmental directions (e.g., the walls), one
object would be turned until its parts are parallel to those of the other, so
that they will have the same orientations with respect to the room. Parsons
and Hinton found that subjects aligned one object so that it was nearly paral­
lel with another, with a partial compromise toward keeping the object’s reti­
nal projections similar (possibly so that corresponding cube faces on the two
objects would be simultaneously visible). This suggests that part orientations
are represented primarily with respect to environmentally-influenced frames.

Visual cognition


The choice of a reference object, surface, or body part is closely tied to
the format of the coordinate system aligned with the frame of reference, since
rotatable objects (such as the eyes) and fixed landmarks easily support coor­
dinate systems containing polar scales, whereas reference frames with ortho­
gonal directions (e.g., gravity and the ground, the walls of a room) easily
support Cartesian-like coordinate systems. The type of coordinate system
employed has effects on the ease of making certain spatial judgments. As
mentioned, the 2V2-D sketch represents information in a roughly spherical
coordinate system, with the result that the easiest information to extract
concerning the position of an edge or feature is its distance and direction with
respect to the vantage point. As Marr (1982) points out, this representation
conceals many of the geometric properties of surfaces that it would be desir­
able to have easy access to; something closer to a Cartesian coordinate system
centered on the viewer would be much handier for such purposes. For exam­
ple, if two surfaces in different parts of the visual field are parallel, their
orientations as measured in a spherical coordinate system will be different,
but their orientations as measured in a coordinate system with a parallel
component (e.g., Cartesian) will be the same (see Fig. 4). If a surface is flat,
the represented orientations of all the patches composing its surface will be
identical in Cartesian, but not in spherical coordinates. Presumably, size con­
stancy could also be a consequence of such a coordinate system, if a given
range of coordinates in the left-right or up-down directions always stood for
Figure 4.

Effects o f rectangular versus polar coordinate systems on making spatial
judgments. Whether two surfaces are parallel can be assessed by comparing
their angles with respect to the straight ahead direction in a rectangular
coordinate system (b), but not by comparing their angles with respect to the
lines o f sight in a polar system (a). From Marr (1982).




S. Pinker

a constant real world distance regardless of the depth of the represented
One potentially relevant bit of evidence comes from a phenomenon studied
by Corcoran (1977), Natsoulas (1966), and Kubovy et cil. (1984, Reference
note 1). When an asymmetric letter such as ‘d' is traced with a finger on the
back of a person's head, the person will correctly report what the letter is.
But when the same letter is traced on the person's forehead, the mirror image
of that letter is reported instead (in this case, *b'). This would follow if space
(and not just visible space) is represented in a parallel coordinate system
aligned with a straight ahead direction, such as that shown in Fig. 4b. The
handedness of a letter would be determined by whether its spine was situated
to the left or right of the rest of its parts, such that ‘left’ and ‘right’ would be
determined by a direction orthogonal to the straight ahead direction, regard­
less of where on the head the letter is drawn. The phenomenon would not
be expected in an alternative account, where space is represented using spher­
ical coordinates centered on a point at or behind the eyes (e.g., Fig. 4a),
because then the letter would be reported as if ‘seen’ from the inside of a
transparent skull, with letters traced on the back of the head reported as
mirror-reversed, contrary to fact.
In many experiments allowing subjects to choose between environmental,
Cartesian-like reference frames and egocentric, spherical reference frames,
subjects appear to opt for a compromise (e.g., the Parsons and Hinton and
Kubovy et al. studies; see also Attneave, 1972; Gilinsky, 1955; Uhlarik et al.
1980). It is also possible that we have access to both systems, giving rise to
ambiguities when a single object is alternatively represented in the two sys­
tems, for example, when railroad tracks are seen either as parallel or as
converging (Boring, 1952; Gibson, 1950; Pinker, 1980a), or when the corner
formed by two edges of the ceiling of a room can be seen both as a right
angle and as an obtuse angle.
Deriving shape descriptions
One salient problem with the Marr and Nishihara model of shape recogni­
tion in its current version is that there is no general procedure for deriving
an object-centered 3-D shape description from the 2V2-D sketch. The al­
gorithm proposed by Marr and Nishihara using the two-dimensional
silhouette of a shape to find its intrinsic axes has trouble deriving shape
descriptions when axes are foreshortened or occluded by other parts of the
object (as Marr and Nishihara pointed out). In addition, the procedures it
uses for joining up part boundaries to delineate parts, to find axes of parts
once they are delineated, and to pair axes with one another in adjunct rela­
tions rely on some limited heuristics that have not been demonstrated to
work other than for objects composed of generalized cones—but the per-

Visual cognition


ceiver cannot in general know prior to recognition whether he or she is
viewing such an object. Furthermore, there is no explicit procedure for group­
ing together the parts that belong together in a single hierarchical level in the
3-D model description. Marr and Nishihara suggest that all parts lying within
a ‘coarse spatial context’ surrounding an axis can be placed within the scope
of the model specific to that axis, but numerous problems could arise when
unrelated parts are spatially contiguous, such as when a hand is resting on a
knee. Some of these problems perhaps could be resolved using an essentially
similar scheme when information that is richer than an object’s silhouette is
used. For example, the depth, orientation, and discontinuity information in
the 2V2-D sketch could assist in the perception of foreshortened axes (though
not when the blunt end of a tapering object faces the viewer squarely), and
information about which spatial frequency bandpass channels an edge came
from could help in the segregation of parts into hierarchical levels in a shape
A general problem in deriving shape representations from the input is that,
as mentioned, the choice of the appropriate reference frame and shape primi­
tives depends on what type of shape it is, and shapes are recognized via their
description in terms of primitives relative to a reference frame. In the remain­
der of this section I describe three types of solutions to this chicken-and-egg
Top-down processing
One response to the inherent difficulties of assigning descriptions to objects
on the basis of their retinal images is to propose that some form of ancillary
information based on a person’s knowledge about regularities in the world is
used to choose the most likely description or at least to narrow down the
options (e.g., Gregory, 1970; Lindsay and Norman, 1977; Minsky, 1975;
Neisser, 1967). For example, a cat-owner could recognize her cat upon seeing
only a curved, long, grey stripe extending out from underneath her couch,
based on her knowledge that she has a long-tailed grey cat that enjoys lying
there. In support of top-down, or, more precisely, knowledge-guided percep­
tual analysis, Neisser (1967), Lindsay and Norman (1977), and Gregory
(1970) have presented many interesting demonstrations of possible retinal
ambiguities that may be resolved by knowledge of physical or object-specific
regularities, and Biederman (1972), Weisstein and Harris (1974) and Palmer
(1975b) and others have shown that the recognition of an object, a part of
an object, or a feature can depend on the identity that we attribute to the
context object or scene as a whole.
Despite the popularity of the concept of top-down processing within cog­
nitive science and artificial intelligence during much of the 1960s and 1970s,


S. Pinker

there are three reasons to question the extent to which general knowledge
plays a role in describing and recognizing shapes. First, many of the supposed
demonstrations of top-down processing leave it completely unclear what kind
of knowledge is brought to bear on recognition (e.g., regularities about the
geometry of physical objects in general, about particular objects, or about
particular scenes or social situations), and how that knowledge is brought to
bear (e.g., altering the order in which memory representations are matched
against the input, searching for particular features or parts in expected places,
lowering the goodness-of-fit threshold for expected objects, generating and
fitting templates, filling in expected parts). Fodor (1983) points out that these
different versions of the top-down hypothesis paint very different pictures of
how the mind is organized in general: if only a restricted type of knowledge
can influence perception in a top-down manner, and then only in restricted
ways, the mind may be constructed out of independent modules with re­
stricted channels of communication among them. But if all knowledge can
influence perception, the mind could consist of an undifferentiated knowl­
edge base and a set of universal inference procedures which can be combined
indiscriminately in the performance of any task. Exactly which kind of topdown processing is actually supported by the data can make a big difference
in one’s conception of how the mind works; Fodor argues that so far most
putative demonstrations of top-down phenomena are not designed to distin­
guish among possible kinds of top-down processing and so are uninformative
on this important issue.
A second problem with extensive top-down processing is that there is a
great deal of information about the world that is contained in the light array,
even if that information cannot be characterized in simple familiar schemes
such as templates or features (see Gibson, 1966, 1979; Marr, 1982). Given
the enormous selection advantage that would be conferred on an organism
that could respond to what was really in the world as opposed to what it
expected to be in the world whenever these two descriptions were in conflict,
we should seriously consider the possibility that human pattern recognition
has the most sophisticated bottom-up pattern analyses that the light array and
the properties of our receptors allow. And as Ullman (1984, this issue) points
out, we do appear to be extremely accurate perceivers even when we have
no basis for expecting one object or scene to occur rather than another, such
as when watching a slide show composed of arbitrary objects and scenes.
Two-stage analysis o f objects
Ullman (1984) suggests that our visual systems may execute a universal set
of ‘routines’ composed of simple processes operating on the 2V2-D sketch,
such as tracing along a boundary, filling in a region, marking a part, and

Visual cognition


sequentially processing different locations. Once universal routines arc exe­
cuted, their outputs could characterize some basic properties of the promi­
nent entities in the scene such as their rough shape and spatial relationships.
This characterization could then trigger the execution of routines specific to
the recognition of particular objects or classes of objects. Because routines
can be composed of arbitrary sequences of very simple but powerful proces­
ses, it might be possible to compile both a core of generally useful routines,
plus a large set of highly specific routines suitable for the recognition of very
different classes of objects, rather than a canonical description scheme that
would have to serve for every object type. (In Ullman’s theory visual routines
would be used not only for the recognition of objects but also for geometric
reasoning about the surrounding visual environment such as determining
whether one object is inside another or counting objects.) Richards (1979,
1982) makes a related proposal concerning descriptions for recognition, spec­
ifically, that one might first identify various broad classes of objects such as
animal, vegetable, or mineral by looking for easily sensed hallmarks of these
classes such as patterns of movement, color, surface texture, dimensionality,
even coarse spectral properties of their sounds. Likely reference frames and
shape primitives could then be hypothesized based on this first-stage categori­
Massively parallel models
There is an alternative approach that envisions a very different type of
solution from that suggested by Richards, and that advocates very different
types of mechanisms from those described in this issue by Ullman. Attneave
(1982), Hinton (1981) and Hrechanyk and Ballard (1982) have outlined re­
lated proposals for a model of shape recognition using massively parallel
networks of simple interconnected units, rather than sequences of operations
performed by a single powerful processor (see Ballard et al. 1983; Feldman
and Ballard, 1982; Hinton and Anderson, 1981), for introductions to this
general class of computational architectures).
A favorite analogy for this type of computation (e.g., Attneave, 1982) is
the problem of determining the shape of a film formed when an irregularly
shaped loop of wire is dipped in soapy water (the shape can be characterized
by quantizing the film surface into patches and specifying the height of each
patch). The answer to the problem is constrained by the ‘least action’ princi­
ple ensuring that the height of any patch of the film must be close to the
heights of all its neighboring patches. But how can this information be used
if one does not know beforehand the heights of all the neighbors of any
patch? One can solve the problem iteratively, by assigning every patch an
arbitrary initial height except for those patches touching the wire loop, which


S. Pinker

are assigned the same heights as the piece of wire they are attached to. Then
the heights of each of the other patches is replaced by the average height of
its neighbors. This is repeated through several iterations; eventually the array
of heights converges on a single set of values corresponding to the shape of
the film, thanks to constraints on height spreading inward from the wire. The
solution is attained without knowing the height of any single interior patch a
priori, and without any central processor.
Similarly, it may be possible to solve some perceptual problems using
networks of simple units whose excitatory and inhibitory interconnections
lead the entire network to converge to states corresponding to certain
geometric constraints that must be satisfied when recognition succeeds. Marr
and Poggio (1976) proposed such a ‘cooperative’ model for stereo vision that
simultaneously finds the relative distance from the viewer of each feature in
pair of stereoscopic images and which feature in one image corresponds with
a given feature in the other. It does so by exploiting the constraints that each
feature must have a single disparity and that neighboring features mostly
have similar disparities.
In the case of three-dimensional shape recognition, Attneave, Hinton, and
Hrechanyk and Ballard point out that there are constraints on how shape
elements and reference frames may be paired that might be exploitable in
parallel networks to arrive at both simultaneously. First, every part of an
object must be described with respect to the same object-centered reference
frame (or at least, every part of an object in a circumscribed region at a given
scale of decomposition; see the discussion of the Marr and Nishihara model).
For example, if one part is described as the front left limb of a animal standing
broadside to the viewer and facing to the left, another part of the same object
cannot simultaneously be described as the rear left limb of that animal facing
to the right. Second, a description of parts relative to an object-centered
frame is to be favored if that description corresponds to an existing object de­
scription in memory. For example, a horizontal part will be described as the
downward-pointing leg of a chair lying on its back rather than as the forward­
facing leg of an unfamiliar upright object.
These constraints, it is argued, can be used to converge on a unique correct
object-centered description in a network of the following sort. There is a
retina-based unit for every possible part at every retinal size, location, and
orientation. There is also an object-based unit for every orientation, location,
and size of a part with respect to an object axis. Of course, these units cannot
be tied to individual retina-based units, but each object-based unit can be
connected to the entire set of retina-based units that are geometrically consis­
tent with it. Every shape description in memory consists of a shape unit that
is connected to its constituent object-based units. Finally, all the pairs of

Visual cognition

Figure 5.


A portion o f a massively parallel network model for shape recognition.
Triangular symbols indicate special multiplicative connections: the product
o f activation levels o f a retina-based and a mapping unit is transmitted to
an object-based unit, and the product o f the activation levels in those retinabased and object-based units is transmitted to the mapping unit. Trom Hin­
ton (1981).
s h a p e units


m a p p in g
\ u n its


object- and retina-based units that correspond to a single orientation of the
object axis relative to the viewer are themselves tied together by a mapping
unit, such that the system contains one such unit for each possible spatial
relation between object and viewer. An example of such a network, taken
from Hinton (1981), is shown in Fig. 5.
The system’s behavior is characterized as follows. The visual input activates
retina-based units. Retina-based units activate all the object-based units they


S. Pinker

are connected to (this will include all object-based units that are geometrically
compatible with the retinal features, including units that are inappropriate
for the current object). Object-based units activate their corresponding shape
units (again, both appropriate and inappropriate ones). Joint activity in par­
ticular retina- and object-based units activate the mapping units linking the
two, that is, the mapping units that represent vantage points (relative to an
object axis) for which those object-based features project as those retinabased features. Similarly, joint activity in retina-based and mapping units
activate the corresponding object-based units. Shape units activate their cor­
responding object-based units; and (presumably) shape units inhibit other
shape units and mapping units inhibit other mapping units. Hinton (1981)
and Hrechanyk and Ballard (1982) argue that such networks should enter
into a positive feedback loop converging on a single active shape unit, repre­
senting the recognized shape, and a single active mapping unit, representing
the orientation and position of its axis with respect to the viewer, when a
familiar object is viewed.
In general, massively parallel models are effective at avoiding the search
problems that accompany serial computational architectures. In effect, the
models are intended to assess the goodness-of-fit between all the transforma­
tions of an input pattern and all the stored shape descriptions in parallel,
finding the pair with the highest fit at the same time. Since these models are
in their infancy, it is too early to evaluate the claims associated with them.
Among other things, it will be necessary to determine: (a) whether the model
can be interfaced to preprocessing systems that segregate an object from its
background and isolate sets of parts belonging to a single object-centered
frame at a single scale; (b) whether the initial activation of object-based units
by retina-based units is selective enough to head the network down a path
leading toward convergence to unique, correct solutions; (c) whether the
number of units and interconnections among units needed to represent the
necessary combinations of shapes, parts, and their dispositions is neurologically feasible; and (d) whether these networks can overcome their current
difficulty at representing and recognizing relations among parts in complex
objects and scenes, in addition to the parts themselves.
Visual imagery

Visual imagery has always been a central topic in the study of cognition. Not
only is it important to understand our ability to reason about objects and
scenes that are remembered rather than seen, but the study of imagery is tied
to the question of the number and format of mental representations, and of

Visual cognition


the interface between perception and cognition. Imagery may also be a par­
ticularly fruitful topic for study among the higher thought processes because
of its intimate connection with shape recognition, benefitting from the prog­
ress made in that area. Finally, the subject of imagery is tied to scientific and
literary creativity, mathematical insight, and the relation between cognition
and emotion (see the papers in Sheikh, 1983); though the scientific study o