Principal Visual Cognition
Due to the technical work on the site downloading books (as well as file conversion and sending books to email/kindle) may be unstable from May, 27 to May, 28 Also, for users who have an active donation now, we will extend the donation period.
You may be interested in Powered by Rec2Me
Most frequently terms
VISUAL COGNITION edited A by Steven CDGniTIDn Special Issue Pinker Visual Cognition a c D E n m o n Special Issue First Published as a Special Issue of Cognition, Jacques Mehler, Editor Visual Cognition, Steven Pinker, Guest Editor Visual Cognition edited by Steven Pinker A Bradford Book The MIT Press Cambridge, Massachusetts London, England Third printing, 1988 First MIT Press edition, 1985 Copyright © 1984 by Elsevier Science Publishers B.V., Amsterdam, The Netherlands All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of the copyright owner. Reprinted from Cognition: International Journal of Cognitive Psychology, volume 18 (ISSN: 0010-0277). The MIT Press has exclusive license to sell this English-language book edition throughout the world. Printed and bound in the United States of America Library of Congress Cataloging in Publication Data Main entry under title: Visual cognition. (Computational models of cognition and perception) “ A Bradford book.” “ Reprinted from Cognition international journal of cognitive psychology, volume 18”—T.p. verso. Bibliography: p. Includes index. 1. Visual perception. 2. Cognition. I. Pinker, Steven, 1954. II. Series. BF241.V564 1985 153.7 85-24155 ISBN 0-262-16103-6 Contents Preface vii Visual Cognition: An Introduction Steven Pinker I Parts of Recognition Donald D. Hoffman and Whitman A. Richards 65 Visual Routines Shimon Ullman 97 Upward Direction, Mental Rotation, and Discrimination of Left and Right Turns in Maps Roger N. Shepard and Shelley Hurwitz Individual Differences in Mental Imagery Ability: AComputational Analysis Stephen M. Kosslyn, Jennifer Brunn, Kyle R. Cave, and Roger W. Wallach 161 195 The Neurological Basis of Mental Imagery: AComponential Analysis Martha J. Farah 245 Index 273 v Preface This collection of original research papers on; visual cognition first appeared as a special issue of Cognition: International Journal of Cognitive Science. The study of visual cognition has seen enormous progress in the past decade, bringing impor tant advances in our understanding of shape perception, visual imagery, and mental maps. Many of these discoveries are the result of converging investiga tions in different areas, such as cognitive and perceptual psychology, artificial intelligence, and neuropsychology. This volume is intended to highlight a sample of work at the cutting edge of this research area for the benefit of students and researchers in a variety of disciplines. The tutorial introduction that begins the volume is designed to help the nonspecialist reader bridge the gap between the contemporary research reported here and earlier textbook introductions or litera ture reviews. Many people deserve thanks for their roles in putting together this volume: Jacques Mehler, Editor of Cognition; Susana Franck, Editorial Associate of Cog nition', Henry Stanton and Elizabeth Stanton, Editors of Bradford Books; Kath leen Murphy, Administrative Secretary, Department of Psychology, MIT; Loren Ann Frost, who compiled the index; and the ad hoc Cognition referees who reviewed manuscripts for the special issue. I am also grateful to Nancy Etcoff, Stephen Kosslyn, and Laurence Parsons for their advice and encouragement. Preparation of this volume was supported by NSF grants BNS 82-16546 and BNS 82-19450, by NIH grant 1R01 HD 18381, and by the MIT Center for Cogni tive Science under a grant from the A. P. Sloan Foundation. Visual Cognition 1 Visual cognition: An introduction* STEVEN PINKER Massachusetts Institute of Technology Abstract This article is a tutorial overview o f a sample o f central issues in visual cogni tion, focusing on the recognition o f shapes and the representation o f objects and spatial relations in perception and imagery. Brief reviews o f the state o f the art are presented, followed by more extensive presentations o f contemporary theories, findings, and open issues. I discuss various theories o f shape recogni tion, such as template, feature, Fourier, structural description, Marr-Nishihara, and massively parallel models, and issues such as the reference frames, primitives, top-down processing, and computational architectures used in spa tial cognition. This is followed by a discussion o f mental imagery, including conceptual issues in imagery research, theories o f imagery, imagery and per ception, image transformations, computational complexities o f image pro cessing, neuropsychological issues, and possible functions o f imagery. Connec tions between theories o f recognition and o f imagery, and the relevance o f the papers contained in this issue to the topics discussed, are emphasized through out. Recognizing and reasoning about the visual environment is something that people do extraordinarily well; it is often said that in these abilities an average three-year old makes the most sophisticated computer vision system look embarrassingly inept. Our hominid ancestors fabricated and used tools for millions of years before our species emerged, and the selection pressures brought about by tool use may have resulted in the development of sophisti cated faculties allowing us to recognize objects and their physical properties, to bring complex knowledge to bear on familiar objects and scenes, to ^Preparation of this paper was supported by NSF grants BNS 82-16546 and 82-09540, by NIH grant 1R01HD18381-01, and by a grant from the Sloan Foundation awarded to the MIT Center for Cognitive Sci ence. I thank Donald Hoffman, Stephen Kosslyn, Jacques Mehler, Larry Parsons, Whitman Richards, and Ed Smith for their detailed comments on an earlier draft, and Kathleen Murphy and Rosemary Krawcz\k for as sistance in preparing the manuscript. Reprint requests should be sent to Steven Pinker, Psychologv Depart ment, M .I.T., E10-018, Cambridge, MA 02139, U .S.A . 2 S. Pinker negotiate environments skillfully, and to reason about the possible physical interactions among objects present and absent. Thus visual cognition, no less than language or logic, may be a talent that is central to our understanding of human intelligence (Jackendoff, 1983; Johnson-Laird, 1983; Shepard and Cooper, 1982). Within the last 10 years there has been a great increase in our understand ing of visual cognitive abilities. We have seen not only new empirical de monstrations, but also genuinely new theoretical proposals and a new degree of explicitness and sophistication brought about by the use of computational modeling of visual and memory processes. Visual cognition, however, oc cupies a curious place within cognitive psychology and within the cognitive psychology curriculum. Virtually without exception, the material on shape recognition found in introductory textbooks in cognitive psychology would be entirely familiar to a researcher or graduate student of 20 or 25 years ago. Moreover, the theoretical discussions of visual imagery are cast in the same loose metaphorical vocabulary that had earned the concept a bad name in psychology and philosophy for much of this century. I also have the impres sion that much of the writing pertaining to visual cognition among researchers who are not directly in this area, for example, in neuropsychology, individual differences research, developmental psychology, psychophysics, and informa tion processing psychology, is informed by the somewhat antiquated and imprecise discussions of visual cognition found in the textbooks. The purpose of this special issue of Cognition is to highlight a sample of theoretical and empirical work that is on the cutting edge of research on visual cognition. The papers in this issue, though by no means a representa tive sample, illustrate some of the questions, techniques, and types of theory that characterize the modern study of visual cognition. The purpose of this introductory paper is to introduce students and researchers in neighboring disciplines to a selection of issues and theories in the study of visual cognition that provide a backdrop to the particular papers contained herein. It is meant to bridge the gap between the discussions of visual cognition found in textbooks and the level of discussion found in contemporary work. Visual cognition can be conveniently divided into two subtopics. The first is the representation of information concerning the visual world currently before a person. When we behave in certain ways or change our knowledge about the world in response to visual input, what guides our behavior or thought is rarely some simple physical property of the input such as overall brightness or contrast. Rather, vision guides us because it lets us know that we are in the presence of a particular configuration of three-dimensional shapes and particular objects and scenes that we know to have predictable properties. ‘Visual recognition’ is the process that allows us to determine on Visual cognition 3 the basis of retinal input that particular shapes, configurations of shapes, objects, scenes, and their properties are before us. The second subtopic is the process of remembering or reasoning about shapes or objects that are not currently before us but must be retrieved from memory or constructed from a description. This is usually associated with the topic of ‘visual imagery’. This tutorial paper is divided into two major sec tions, devoted to the representation and recognition of shape, and to visual imagery. Each section is in turn subdivided into sections discussing the background to each topic, some theories on the relevant processes, and some of the more important open issues that will be foci of research during the coming years. Visual recognition Shape recognition is a difficult problem because the immediate input to the visual system (the spatial distribution of intensity and wavelength across the retinas—hereafter, the “retinal array”) is related to particular objects in highly variable ways. The retinal image projected by an object—say, a notebook—is displaced, dilated or contracted, or rotated on the retina when we move our eyes, ourselves, or the book; if the motion has a component in depth, then the retinal shape of the image changes and parts disappear and emerge as well. If we are not focusing on the book or looking directly at it, the edges of the retinal image become blurred and many of its finer details are lost. If the book is in a complex visual context, parts may be occluded, and the edges of the book may not be physically distinguishable from the edges and surface details of surrounding objects, nor from the scratches, surface markings, shadows, and reflections on the book itself. Most theories of shape recognition deal with the indirect and ambiguous mapping between object and retinal image in the following way. In long-term memory there is a set of representations of objects that have associated with them information about their shapes. The information does not consist of a replica of a pattern of retinal stimulation, but a canonical representation of the object’s shape that captures somç invariant properties of the object in all its guises. During recognition, the retinal image is converted into the same format as is used in long-term memory, and the memory representation that matches the input the closest is selected. Different theories of shape recogni tion make different assumptions about the long-term memory representations involved, in particular, how many representations a single object will have, which class of objects will be mapped onto a single representation, and what the format of the representation is (i.e. which primitive symbols can be found 4 S. Pinker in a representation, and what kinds of relations among them can be specified). They will differ in regards to which sports of preprocessing are done to the retinal image (e.g., filtering, contrast enhancement, detection of edges) prior to matching, and in terms of how the retinal input or memory representations are transformed to bring them into closer correspondence. And they differ in terms of the metric of goodness o f fit that determines which memory representation fits the input best when none of them fits it exactly. Traditional theories o f shape recognition Cognitive psychology textbooks almost invariably describe the same three or so models in their chapters on pattern recognition. Each of these models is fundamentally inadequate. However, they are not always inadequate in the ways the textbooks describe, and at times they are inadequate in ways that the textbooks do not point out. An excellent introduction to three of these models—templates, features, and structural descriptions—can be found in Lindsay and Norman (1977); introductions to Fourier analysis in vision, which forms the basis of the fourth model, can be found in Cornsweet (1980) and Weisstein (1980). In this section I will review these models extremely briefly, and concentrate on exactly why they do not work, because a catalogue of their deficits sets the stage for a discussion of contemporary theories and issues in shape recognition. Template matching This is the simplest class of models for pattern recognition. The long term memory representation of a shape is a replica of a pattern of retinal stimula tion projected by that shape. The input array would be simultaneously superimposed with all the templates in memory, and the one with the closest above-threshold match (e.g., the largest ratio of matching to nonmatching points in corresponding locations in the input array) would indicate the pat tern that is present. Usually this model is presented not as a serious theory of shape recogni tion, but as a straw man whose destruction illustrates the inherent difficulty of the shape recognition process. The problems are legion: partial matches could yield false alarms (e.g., a ‘P’ in an ‘R’ template); changes in distance, location, and orientation of a familiar object will cause this model to fail to detect it, as will occlusion of part of the pattern, a depiction of it with wiggly or cross-hatched lines instead of straight ones, strong shadows, and many other distortions that we as perceivers take in stride. There are, nonetheless, ways of patching template models. For example, Visual cognition 5 multiple templates of a pattern, corresponding to each of its possible displace ments, rotations, sizes, and combinations thereof, could be stored. Or, the input pattern could be rotated, displaced, and scaled to a canonical set of values before matching against the templates. The textbooks usually dismiss these possibilities: it is said that the product of all combinations of transforma tions and shapes would require more templates than the brain could store, and that in advance of recognizing a pattern, one cannot in general determine which transformations should be applied to the input. However, it is easy to show that these dismissals are made too quickly. For example, Arnold Trehub (1977) has devised a neural model of recognition and imagery, based on templates, that addresses these problems (this is an example of a ‘massively parallel’ model of recognition, a class of models I will return to later). Con tour extraction preprocesses feed the matching process with an array of sym bols indicating the presence of edges, rather than with a raw array of intensity levels. Each template could be stored in a single cell, rather than in a space consuming replica of the entire retina: such a cell would synapse with many retinal inputs, and the shape would be encoded in the pattern of strengths of those synapses. The input could be matched in parallel against all the stored memory templates, which would mutually inhibit one another so that partial matches such as ‘P’ for ‘R’ would be eliminated by being inhibited by better matches. Simple neural networks could center the input pattern and quickly generate rotated and scaled versions of it at a variety of sizes and orientations, or at a canonical size and orientation (e.g., with the shape’s axis of elongation vertical); these transformed patterns could be matched in parallel against the stored templates. Nonetheless, there are reasons to doubt that even the most sophisticated versions of template models would work when faced with realistic visual inputs. First, it is unlikely that template models can deal adequately with the third dimension. Rotations about any axis other than the line of sight cause distortions in the projected shape of an object that cannot be inverted by any simple operation on retina-like arrays. For example, an arbitrary edge might move a large or a small amount across the array depending on the axis and phase of rotation and the depth from the viewer. 3-D rotation causes some surfaces to disappear entirely and new ones to come into view. These prob lems occur even if one assumes that the arrays are constructed subsequent to stereopsis and hence are three-dimensional (for example, rear surfaces are still not represented, there are a bewildering number of possible directions of translation and axes of rotation, each requiring a different type of retinal transformation). Second, template models work only for isolated objects, such as a letter presented at the center of a blank piece of paper: the process would get 6 S. Pinker nowhere if it operated, say, on three-fifths of a book plus a bit of the edge of the table that it is lying on plus the bookmark in the book plus the end of the pencil near it, or other collections of contours that might be found in a circumscribed region of the retina. One could posit some figure-ground segregation preprocess occurring before template matching, but this has prob lems of its own. Not only would such a process be highly complex (for exam ple, it would have to distinguish intensity changes in the image resulting from differences in depth and material from those resulting from differences in orientation, pigmentation, shadows, surface scratches, and specular (glossy) reflections), but it probably interacts with the recognition process and hence could not precede it. For example, the figure-ground segregation process involves carving up a set of surfaces into parts, each of which can then be matched against stored templates. This process is unlikely to be distinct from the process of carving up a single object into its parts. But as Hoffman and Richards (1984) argue in this issue, a representation of how an object is decomposed into its parts may be the first representation used in accessing memory during recognition, and the subsequent matching of particular parts, template-style or not, may be less important in determining how to classify a shape. Feature models This class of models is based on the early “Pandemonium” model of shape recognition (Selfridge, 1959; Selfridge and Neisser, 1960). In these models, there are no templates for entire shapes; rather, there are mini-templates or ‘feature detectors’ for simple geometric features such as vertical and horizon tal lines, curves, angles, ‘T’-junctions, etc. There are detectors for every feature at every location in the input array, and these detectors send out a graded signal encoding the degree of match between the target feature and the part of the input array they are ‘looking at’. For every feature (e.g., an open curve), the levels of activation of all its detectors across the input array are summed, or the number of occurrences of the feature are counted (see e.g., Lindsay and Norman, 1977), so the output of this first stage is a set of numbers, one for each feature. The stored representation of a shape consists of a list of the features com posing the shape, in the form of a vector of weights for the different features, a list of how many tokens of each feature are present in the shape, or both. For example, the representation of the shape of the letter ‘A ’ might specify high weights for (1) a horizontal segment, (2) right-leaning diagonal segment, (3) a left-leaning diagonal segment, (4) an upward-pointing acute angle, and so on, and low or negative weights for curved and vertical segments. The intent is to use feature weights or counts to give each shape a characterization Visual cognition 1 that is invariant across transformations of it. For example, since the features are all independent of location, any feature specification will be invariant across translations and scale changes; and if features referring to orientation (e.g. “left-leaning diagonal segment”) are eliminated, and only features dis tinguishing straight segments from curves from angles are retained, then the description will be invariant across frontal plane rotations. The match between input and memory would consist of some comparison of the levels of activation of feature detectors in the input with the weights of the corresponding features in each of the stored shape representations, for example, the product of those two vectors, or the number of matching fea tures minus the number of mismatching features. The shape that exhibits the highest degree of match to the input is the shape recognized. The principal problem with feature analysis models of recognition is that no one has ever been able to show how a natural shape can be defined in terms of a vector of feature weights. Consider how one would define the shape of a horse. Naturally, one could define it by giving high weights to features like ‘mane’, ‘hooves’, ‘horse’s head’, and so on, but then detecting these features would be no less difficult than detecting the horse itself. Or, one could try to define the shape in terms of easily detected features such as vertical lines and curved segments, but horses and other natural shapes are composed of so many vertical lines and curved segments (just think of the nose alone, or the patterns in the horse’s hide) that it is hard to believe that there is a feature vector for a horse’s shape that would consistently beat out feature vectors for other shapes across different views of the horse. One could propose that there is a hierarchy of features, intermediate ones like ‘eye’ being built out of lower ones like ‘line segment’ or ‘circle’, and higher ones like ‘head’ being built out of intermediate ones like ‘eye’ and ‘ear’ (Selfridge, for example, posited “computational demons” that detect Boolean combinations of features), but no one has shown how this can be done for complex natural shapes. Another, equally serious problem is that in the original feature models the spatial relationships among features—how they are located and oriented with respect to one another—are generally not specified; only which ones are present in a shape and perhaps how many times. This raises serious problems in distinguishing among shapes consisting of the same features arranged in different ways, such as an asymmetrical letter and its mirror image. For the same reason, simple feature models can turn reading into an anagram prob lem, and can be shown formally to be incapable of detecting certain pattern distinctions such as that between open and closed curves (see Minsky and Papert, 1972). One of the reasons that these problems are not often raised against feature 8 S. Pinker models is that the models are almost always illustrated and referred to in connection with recognizing letters of the alphabet or schematic line draw ings. This can lead to misleading conclusions because the computational prob lems posed by the recognition of two-dimensional stimuli composed of a small number of one-dimensional segments may be different in kind from the problems posed by the recognition of three-dimensional stimuli composed of a large number of two-dimensional surfaces (e.g., the latter involves compen sating for perspective and occlusion across changes in the viewer’s vantage point and describing the complex geometry of curved surfaces). Furthermore, when shapes are chosen from a small finite set, it is possible to choose a feature inventory that exploits the minimal contrasts among the particular members of the set and hence successfully discriminates among those members, but that could be fooled by the addition of new members to the set. Finally, letters or line drawings consisting of dark figures presented against a blank background with no other objects occluding or touching them avoids the many difficult problems concerning the effects on edge detection of occlusion, illumination, shadows, and so on. Fourier models Kabrisky (1966), Ginsburg (1971, 1973), and Persoon and Fu (1974; see also Ballard and Brown, 1982) have proposed a class of pattern recognition models that that many researchers in psychophysics and visual physiology adopt implicitly as the most likely candidate for shape recognition in humans. In these models, the two-dimensional input intensity array is subjected to a spatial trigonometric Fourier analysis. In such an analysis, the array is decom posed into a set of components, each component specific to a sinusoidal change in intensity along a single orientation at a specific spatial frequency. That is, one component might specify the degree to which the image gets brighter and darker and brighter and darker, etc., at intervals of 3° of visual angle going from top right to bottom left in the image (averaging over changes in brightness along the orthogonal direction). Each component can be con ceived of as a grid consisting of parallel black-and-white stripes of a particular width oriented in a particular direction, with the black and white stripes fading gradually into one another. In a full set of such grating-like compo nents, there is one component for each stripe width or spatial frequency (in cycles per degree) at each orientation (more precisely, there would be a continuum of components across frequencies and orientations). A Fourier transform of the intensity array would consist of two numbers for each of these components. The first number would specify the degree of contrast in the image corresponding to that frequency at that orientation (that is, the degree of difference in brightness between the bright areas and Visual cognition 9 the dark areas of that image for that frequency in that orientation), or, roughly, the degree to which the image ‘contains’ that set of stripes. The full set of these numbers is the amplitude spectrum corresponding to the image. The second number would specify where in the image the peaks and troughs of the intensity change defined by that component lie. The full set of these numbers of the phase spectrum corresponding to the image. The amplitude spectrum and the phase spectrum together define the Fourier transform of the image, and the transform contains all the information in the original image. (This is a very crude introduction to the complex subject of Fourier analysis. See Weisstein (1980) and Cornsweet (1970) for excellent nontechni cal tutorials). One can then imagine pattern recognition working as follows. In long-term memory, each shape would be stored in terms of its Fourier transform. The Fourier transform of the image would be matched against the long-term memory transforms, and the memory transform with the best fit to the image transform would specify the shape that is recognized.1 How does matching transforms differ from matching templates in the orig inal space domain? When there is an exact match between the image and one of the stored templates, there are neither advantages nor disadvantages to doing the match in the transform domain, because no information is lost in the transformation. But when there is no exact match, it is possible to define metrics of goodness of fit in the transform domain that might capture some of the invariances in the family of retinal images corresponding to a shape. For example, to a first approximation the amplitude spectrum corresponding to a shape is the same regardless of where in the visual field the object is located. Therefore if the matching process could focus on the amplitude spectra of shape and input, ignoring the phase spectrum, then a shape could be recognized across all its possible translations. Furthermore, a shape and its mirror image have the same amplitude spectrum, affording recognition of a shape across reflections of it. Changes in orientation and scale of an object result in corresponding changes in orientation and scale in the transform, but in some models the transform can easily be normalized so that it is invariant with rotation and scaling. Periodic patterns and textures, such as a brick wall, are easily recognized because they give rise to peaks in their transforms corresponding to the period of repetition of the pattern. But most important, the Fourier transform segregates information about sharp edges and small 'In Persoon and Fu’s model (1974), it is not the transform of brightness as a function of visual field position that is computed and matched, but the transform of the tangent angle of the boundary of an object as a function of position along the boundary. This model shares many of the advantages and disadvantages of Fourier analysis of brightness in shape recognition. IO S. Pinker details from information about gross overall shape. The latter is specified primarily by the lower spatial-frequency components of the transform (i.e., fat gratings), the former, by the higher spatial-frequency components (i.e. thin gratings). Thus if the pattern matcher could selectively ignore the higher end of the amplitude spectrum when comparing input and memory transforms, a shape could be recognized even if its boundaries are blurred, encrusted with junk, or defined by wiggly lines, dots or dashes, thick bands, and so on. Another advantage of Fourier transforms is that, given certain assumptions about neural hardware, they can be extracted quickly and matched in parallel against all the stored templates (see e.g., Pribram, 1971). Upon closer examination, however, matching in the transform domain begins to lose some of its appeal. The chief problem is that the invariances listed above hold only for entire scenes or for objects presented in isolation. In a scene with more than one object, minor rearrangements such as moving an object from one end of a desk to another, adding a new object to the desk top, removing a part, or bending the object, can cause drastic changes in the transform. Furthermore the transform cannot be partitioned or selectively processed in such a way that one part of the transform corresponds to one object in the scene, and another part to another object, nor can this be done within the transform of a single object to pick out its parts (see Hoffman and Richards (1984) for arguments that shape representations must explicitly de fine the decomposition of an object into its parts). The result of these facts is that it is difficult or impossible to recognize familiar objects in novel scenes or backgrounds by matching transforms of the input against transforms of the familiar objects. Furthermore, there is no straightforward way of linking the shape information implicit in the amplitude spectrum with the position information implicit in the phase spectrum so that the perceiver can tell where objects are as well as what they are. Third, changes in the three-dimesional orientation of an object do not result in any simple cancelable change in its transform, even it we assume that the visual system computes three-di mensional transforms (e.g., using components specific to periodic changes in binocular disparity). The appeal of Fourier analysis in discussions of shape recognition comes in part from the body of elegant psychophysical research (e.g., Campbell and Robson, 1968) suggesting that the visual system partitions the information in the retinal image into a set of channels each specific to a certain range of spatial frequencies (this is equivalent to sending the retinal information through a set of bandpass filters and keeping the outputs of those filters separate). This gives the impression that early visual processing passes on to the shape recognition process not the original array but something like a Fourier transform of the array. However, filtering the image according to its Visual cognition 11 spatial frequency components is not the same as transforming the image into its spectra. The psychophysical evidence for channels is consistent with the notion that the recognition system operates in the space domain, but rather than processing a single array, it processes a-family of arrays, each one con taining information about intensity changes over a different scale (or, roughly, each one bandpass-filtered at a different center frequency). By pro cessing several bandpass-filtered images separately, one obtains some of the advantages of Fourier analysis (segregation of gross shape from fine detail) without the disadvantages of processing the Fourier transform itself (i.e. the utter lack of correspondence between the parts of the representation and the parts of the scene). Structural descriptions A fourth class of theories about the format in which visual input is matched against memory holds that shapes are represented symbolically, as structural descriptions (see Minsky, 1975; Palmer, 1975a; Winston, 1975). A structural description is a data structure that can be thought of as a list of propositions whose arguments correspond to parts and whose predicates correspond to properties of the parts and to spatial relationships among them. Often these propositions are depicted as a graph whose nodes correspond to the parts or to properties, and whose edges linking the nodes correspond to the spatial relations (an example of a structural description can be found in the upper left portion of Fig. 6). The explicit representation of spatial relations is one aspect of these models that distinguishes them from feature models and allows them to escape from some of the problems pointed out by Minsky and Papert (1972). One of the chief advantages of structural descriptions is that they can factor apart the information in a scene without necessarily losing information in it. It is not sufficient for the recognition system simply to supply a list of labels for the objects that are recognized, for we need to know not only what things are but also how they are oriented and where they are with respect to us and each other, for example, when we are reaching for an object or driving. We also need to know about the visibility of objects: whether we should get closer, turn up the lights, or remove intervening objects in order to recognize an object with more confidence. Thus the recognition process in general must not boil away or destroy the information that is not diagnostic of particular objects (location, size, orientation, visibility, and surface prop erties) until it ends up with a residue of invariant information; it must factor apart or decouple this information from information about shape, so that different cognitive processes (e.g., shape recognition versus reaching) can access the information relevant to their particular tasks without becoming 12 S. Pinker overloaded, distracted, or misled by the irrelevant information that the retina conflates with the relevant information. Thus one of the advantages of a structural description is that the shape of an object can be specified by one set of propositions, and its location in the visual field, orientation, size, and relation to other objects can be specified in different propositions, each bear ing labels that processing operations can use for selective access to the infor mation relevant to them. Among the other advantages of structural descriptions are the following. By representing the different parts of an object as separate elements in the representation, these models break up the recognition process into simpler subprocesses, and more important, are well-suited to model our visual sys tem’s reliance on decomposition into parts during recognition and its ability to recognize novel rearrangements of parts such as the various configurations of a hand (see Hoffman and Richards (1984)). Second, by mixing logical and spatial relational terms in a representation, structural descriptions can dif ferentiate among parts that must be present in a shape (e.g., the tail of the letter ‘Q’), parts that may be present with various probabilities (e.g., the horizontal cap on the letter T ), and parts that must not be present (e.g., a tail on the letter ‘O’) (see Winston, 1975). Third, structural descriptions represent information in a form that is useful for subsequent visual reasoning, since the units in the representation correspond to objects, parts of objects, and spatial relations among them. Nonvisual information about objects or parts (e.g., categories they belong to, their uses, the situations that they are typically found in) can easily be associated with parts of structural descrip tions, especially since many theories hold that nonvisual knowledge is stored in a propositional format that is similar to structural descriptions (e.g., Minsky, 1975; Norman and Rumelhart, 1975). Thus visual recognition can easily invoke knowledge about what is recognized that may be relevant to visual cognition in general, and that knowledge in turn can be used to aid in the recognition process (see the discussion of top-down approaches to recog nition below). The main problem with the structural description theory is that it is not really a full theory of shape recognition. It specifies the format of the rep resentation used in matching the visual input against memory, but by itself it does not specify what types of entities and relations each of the units belong ing to a structural description corresponds to (e.g., ‘line’ versus ‘eye’ versus ‘sphere4; ‘next-to’ versus ‘to-the-right-of’ versus ‘37-degrees-with-respect-to’), nor how the units are created in response to the appropriate patterns of retinal stimulation (see the discussion of feature models above). Although most researchers in shape recognition would not disagree with the claim that the matching process deals with something like structural descriptions, a Visual cognition 13 genuine theory of shape recognition based on structural descriptions must specify these components and justify why they are appropiate. In the next section, I discuss a theory proposed by David Marr and H. Keith Nishihara which makes specific proposals about each of these aspects of structural de scriptions. Two fundamental problems with the traditional approaches There are two things wrong with the textbook approaches to visual represen tation and recognition. First, none of the theories specifies where perception ends and where cognition begins. This is a problem because there is a natural factoring part of the process that extracts information about the geometry of the visible world and the process that recognizes familiar objects. Take the recognition of a square. We can recognize a square whether its contours are defined by straight black lines printed on a white page, by smooth rows and columns of arbitrary small objects (Kohler, 1947; Koffka, 1935), by differ ences in lightness or in hue between the square and its background, by differ ences in binocular disparity (in a random-dot stereogram), by differences in the orientation or size of randomly scattered elements defining visual textures (Julesz, 1971), by differences in the directions of motion of randomly placed dots (Ullman, 1982; Marr, 1982), and so on. The square can be recognized as being a square regardless of how the boundaries are found; for example, we do not have to learn the shape of a square separately for boundaries defined by disparity in random-dot stereograms, by strings of asterisks, etc., nor must we learn the shapes of other figures separately for each type of edge once we have learned how to do so for a square. Conversely, it can be demonstrated that the ultimate recognition of the shape is not necessary for any of these processes to find the boundaries (the boundaries can be seen even if the shape they define is an unfamiliar blob, and expecting to see a square is neither necessary nor sufficient for the perceiver to see the bound aries; see Gibson, 1966; Marr, 1982; Julesz, 1971). Thus the process that recognizes a shape does not care about how its boundaries were found, and the processes that find the boundaries do not care how they will be used. It makes sense to separate the process of finding boundaries, degree of curva ture, depth, and so on, from the process of recognizing particular shapes (and from other processes such as reasoning that can take their input from vision). A failure to separate these processes has tripped up the traditional ap proaches in the following ways. First, any theory that derives canonical shape representations directly from the retinal arrays (e.g., templates, features) will have to solve all the problems associated with finding edges (see the previous paragraph) at the same time as solving the problem of recognizing particular 14 S. Pinker shapes—an unlikely prospect. On the other hand, any theory that simply assumes that there is some perceptual processing done before the shape match but does not specify what it is is in danger of explaining very little since the putative preprocessing could solve the most important part of the recog nition process that the theory is supposed to address (e.g., a claim that a feature like ‘head’ is supplied to the recognition process). When assumptions about perceptual preprocessing are explicit, but are also incorrect or unmoti vated, the claims of the recognition theory itself could be seriously under mined: the theory could require that some property of the world is supplied to the recognition process when there is no physical basis for the perceptual system to extract that property (e.g., Marr (1982) has argued that it is impos sible for early visual processes to segment a scene into objects). The second problem with traditional approaches is that they do not pay serious attention to what in general the shape recognition process has to do, or, put another way, what problem it is designed to solve (see Marr, 1982). This requires examining the input and desired output of the recognition pro cess carefully: on the one hand, how the laws of optics, projective geometry, materials science, and so on, relate the retinal image to the external world, and on the other, what the recognition process must supply the rest of cogni tion with. Ignoring either of these questions results in descriptions of recog nition mechanisms that are unrealizable, useless, or both. The Marr-Nishihara theory The work of David Marr represents the most concerted effort to examine the nature of the recognition problem, to separate early vision from recognition and visual cognition in general, and to outline an explicit theory of three-di mensional shape recognition built on such foundations. In this section, I will briefly describe Marr’s theory. Though Marr’s shape recognition model is not without its difficulties, there is a consensus that it addresses the most important problems facing this class of theories, and that its shortcomings define many of the chief issues that researchers in shape recognition must face. The 2*12-0 sketch The core of Marr’s theory is a claim about the interface between perception and cognition, about what early, bottom-up visual processes supply to the recognition process and to visual cognition in general. Marr, in collaboration with H. Keith Nishihara, proposed that early visual processing culminates in the construction of a representation called the 2V2-D sketch. The 2V2-D sketch is an array of cells, each cell dedicated to a particular line of sight from the Visual cognition 15 viewer’s vantage point. Each cell in the array is filled with a set of symbols indicating the depth of the local patch of surface lying on that line of sight, the orientation of that patch in terms of the degree and direction in which it dips away from the viewer in depth, and whether an edge (specifically, a discontinuity in depth) or a ridge (specifically, a discontinuity in orientation) is present at that line of sight (see Fig. 1). In other words, it is a representation of the surfaces that are visible when looking in a particular direction from a single vantage point. The 2V2-D sketch is intended to gather together in one representation the richest information that early visual processes can deliver. Marr claims that no top-down processing goes into the construction of the 2V2-D sketch, and that it does not contain any global information about shape (e.g., angles between lines, types of shapes, object or part boundaries), only depths and orientations of local pieces of surface. The division between the early visual processes that culminate in the 2V2-D sketch and visual recognition has an expository as well as a theoretical advan tage: since the early processes are said not to be a part of visual cognition Figure 1 Schematic drawing of Marr and Nishiharas 2lh-D sketch. Arrows represent surface orientation of patches relative to the viewer (the heavy dots are foreshortened arrows). The dotted line represents locations where orienta tion changes discontinuously (ridges). The solid line represents locations where depth changes discontinuously (edges). The depths of patches relative to the viewer are also specified in the 2lh-D sketch but are not shown in this figure. From Marr (1982). 16 S. Pinker (i.e., not affected by a person’s knowledge or intentions), I will discuss them only in bare outline, referring the reader to Marr (1982) and Poggio (1984) for details. The 2V2-D sketch arises from a chain of processing that begins with mechanisms that convert the intensity array into a representation in which the locations of edges and other surface details are made explicit. In this ‘primal sketch’, array cells contain symbols that indicate the presence of edges, corners, bars, and blobs of various sizes and orientations at that loca tion. Many of these elements can remain invariant over changes in overall illumination, contrast, and focus, and will tend to coincide in a relatively stable manner with patches of a single surface in the world. Thus they are useful in subsequent processes that must examine similarities and differences among neighboring parts of a surface, such as gradients of density, size, or shape of texture elements, or (possibly) processes that look for corresponding parts of the world in two images, such as stereopsis and the use of motion to reconstruct shape. A crucial property of this representation is that the edges and other fea tures are extracted separately at a variety of scales. This is done by looking for points where intensity changes most rapidly across the image using detec tors of different sizes that, in effect, look at replicas of the image filtered at different ranges of spatial frequencies. By comparing the locations of intensity changes in each of the (roughly) bandpass-filtered images, one can create families of edge symbols in the primal sketch, some indicating the boundaries of the larger blobs in the image, others indicating the boundaries of finer details. This segregation of edge symbols into classes specific to different scales preserves some of the advantages of the Fourier models discussed above: shapes can be represented in an invariant manner across changes in image clarity and surface detail (e.g., a person wearing tweeds versus polyes ter). The primal sketch is still two-dimensional, however, and the next stage of processing in the Marr and Nishihara model adds the third dimension to arrive at the 2V2-D sketch. The processes involved at this stage compute the depths and orientations .of local patches of surfaces using the binocular dispar ity of corresponding features in the retinal images from the two eyes (e.g., Marr and Poggio, 1977), the relative degrees of movement of features in successive views (e.g., Ullman, 1979), changes in shading (e.g., Horn, 1975), the size and shape of texture elements across the retina (Cutting and Millard, 1984; Stevens, 1981), the shapes of surface contours, and so on. These proces ses cannot indicate explicitly the overall three-dimensional shape of an object, such as whether it is a sphere or a cylinder; their immediate output is simply a set of values for each patch of a surface indicating its relative distance from the viewer, orientation with respect to the line of sight, and whether either Visual cognition 17 depth or orientation changes discontinuously at that patch (i.e., whether an edge or ridge is present). The 2V2-D sketch itself is ill-suited to matching inputs against stored shape representations for several reasons. First, only the visible surfaces of shapes are represented; for obvious reasons, bottom-up processing of the visual input can provide no information about the back sides of opaque objects. Second, the 2V2-D sketch is viewpoint-specific; the distances and orientations of patches of surfaces are specified with respect to the perceiver’s viewing pos ition and viewing direction, that is, in part of a spherical coordinate system centered on the viewer’s vantage point. That means that as the viewer or the object moves with respect to one another, the internal representation of the object in the 2V2-D sketch changes and hence does not allow a successful match against any single stored replica of a past 2V2-D representation of the object (see Fig. 2a). Furthermore, objects and their parts are not explicitly demarcated. Figure 2. The orientation of a hand with respect to the retinal vertical V (a viewer-cen tered reference frame), the axis of the body B (a global object-centered reference frame), and the axis of the lower arm A (a local object-centered reference frame). The retinal angle of the hand changes with rotation of the whole body (middle panel); its angle with respect to the body changes with movement of the elbow and shoulder (right panel). Only its angle with respect to the arm remains constant across these transformations. 18 S. Pinker Shape recognition and 3-D models Marr and Nishihara (1978) have proposed that the shape recognition pro cess (a) defines a coordinate system that is centered on the as-yet unrecog nized object, (b) characterizes the arrangement of the object’s parts with respect to that coordinate system, and (c) matches such characterizations against canonical characterizations of objects’ shapes stored in a similar for mat in memory. The object os described with respect to a coordinate system that is centered on the object (e.g., its origin lies on some standard point on the object and one or more of its axes are aligned with standard parts of the object), rather than with respect to the viewer-centered coordinate system of the 2V2-D sketch, because even though the locations of the object’s parts with respect to the viewer change as the object as a whole is moved, the locations of its parts with respect to the object itself do not change (see Fig. 2b). A structural description representing an object’s shape in terms of the arrange ment of its parts, using parameters whose meaning is determined by a coor dinate system centered upon that object, is called the 3-D model description in Marr and Nishihara’s theory. Centering a coordinate system on the object to be represented solves only some of the problems inherent in shape recognition. A single object-centered description of a shape would still fail to match an input object when the object bends at its joints (see Fig. 2c), when it bears extra small parts (e.g., a horse with a bump on its back), or when there is a range of variation among objects within a class. Marr and Nishihara address this stability problem by proposing that information about the shape of an object is stored not in a single model with a global coordinate system but in a hierarchy of models each representing parts of different sizes and each with its own coordinate system. Each of these local coordinate systems is centered on a part of the shape represented in the model, aligned with its axis of elongation, symmetry, or (for movable parts) rotation. For example, to represent the shape of a horse, there would be a top-level model with a coordinate system centered on the horse’s torso. That coordi nate system would be used to specify the locations, lengths, and angles of the main parts of the horse: the head, limbs, and tail. Then subordinate models are defined for each of those parts: one for the head, one for the front right leg, etc. Each of those models would contain a coordinate system centered on the part that the model as a whole represents, or on a part subordinate to that part (e.g., the thigh for the leg subsystem). The coordinate system for that model would be used to specify the positions, orientations, and lengths of the subordinate parts that comprise the part in question. Thus, within the head model, there would be a specification of the locations and angles of the neck axis and of the head axis, probably with respect to a coordinate system Visual cognition 19 centered on the neck axis. Each of these parts would in turn get its own model, also consisting of a coordinate axis centered on a part, plus a charac terization of the parts subordinate to it. An example of a 3-D model for a human shape is shown in Fig. 3. Employing a hierarchy of corrdinate systems solves the stability problems alluded to above, because even though the position and orientation of the hand relative to the torso can change wildly and unsystematically as a person bends the arm, the position of the hand relative to the arm does not change (except possibly by rotating within the range of angles permitted by bending of the wrist). Therefore the description of the shape of the arm remains constant only when the arrangement of its parts is specified in terms of angles and positions relative to the arm axis, not relative to the object as a whole (see Fig. 2). For this to work, of course, positions, lengths, and angles must be specified in terms of ranges (see Fig. 3d) rather than by precise values, so as to accommodate the changes resulting from movement or indi vidual variation among exemplars of a shape. Note also that the hierarchical arrangement of 3-D models compensates for individual variation in a second way: a horse with a swollen or broken knee, for example, will match the 3-D model defining the positions of a horse’s head, torso, limbs, and tail relative to the torso axis, even if the subordinate limb model itself does not match the input limb. Organization and accessing o f shape information in memory Marr and Nishihara point out that using the 3-D model format, it is pos sible to define a set of values at each level of the hierarchy of coordinate systems that correspond to a central tendency among the members of well-de fined classes of shapes organized around a single ‘plan’. For example, at the top level of the hierarchy defining limbs with respect to the torso, one can define one set of values that most quadruped shapes cluster around, and a different set of values that most bird shapes cluster around. At the next level down one can define values for subclasses of shapes such as songbirds versus long-legged waders. This modular organization of shape descriptions, factoring apart the ar rangement of parts of a given size from the internal structure of those parts, and factoring apart shape of an individual type from the shape of the class of objects it belongs to, allows input descriptions to be matched against mem ory in a number of ways. Coarse information about a shape specified in a top-level coordinate system can be matched against models for general classes (e.g., quadupeds) first, constraining the class of shapes that are checked the next level down, and so on. Thus when recognizing the shape of a person, there is no need to match it against shape descriptions of particular types of 20 S. Pinker ! l \ \ | Lower arm l i \ ' \ 1___ Origin location Part orientation Part P r Human head arm arm torso leg leg DE DE DE CC CC CC AB CC CC AB CC CC NN EE WW NN EE WW ' NN SE SE NN SS SS NN EE WW NN NN NN AB BC BC BC CC CC Arm upper arm lower arm AA CC AA AA NN AA NN NE NN NN CC CC Lower Arm forearm hand AA DD AA AA NN NN NN NN NN NN DD BB Hand palm thumb finger finger finger finger AA AA CC CC CC CC AA BB BB AB AB BB NN NN NN NN SS SS NN NE NN NN NN NN NN NN NN NN NN NN CC BC CC CC CC CC | |-*BB^|**CC**| ♦!)!)*► 0.25 'J ^ Shape (a) 0.5 0.75 I 1.0 0 H .„ < x s Í Visual cognition 21 guppies, parakeets, or beetles once it has been concluded that the gross shape is that of a primate. (Another advantage of using this scheme is that if a shape is successfully matched at a higher level but not at any of the lower levels, it can still be classified as failing into a general class or pattern, such as being a bird, even if one has never encountered that type of bird before). An alternative way of searching shape memory is to allow the successful recogni tion of a shape in a high-level model to trigger the matching of its subordinate part-models against as-yet unrecognized parts in the input, or to allow the successful recognition of individual parts to trigger the matching of their superordinate models against the as-yet unrecognized whole object in the input containing that part. (For empirical studies on the order in which shape representations are matched against inputs, see Jolicoeur et al. 1984a; Rosch et al. 1976; Smith et al. 1978. These studies suggest that the first index into shape memory may be at a ‘basic object’ level, rather than the most abstract level, at least for prototypical exemplars of a shape.) Representing shapes o f parts Once the decomposition of a shape into its component axes is ac complished, the shapes of the components that are centered on each axis must be specified as well. Marr and Nishihara conjecture that shapes of parts may be described in terms of generalized cones (Binford, 1971). Just as a cone can be defined as the surface traced out when a circle is moved along a straight line perpendicular to the circle while its diameter steadily shrinks, a generalized cone can be defined as the surface traced out when any planar closed shape is moved along any smooth line with its size smoothly changing in any way. Thus to specify a particular generalized cone, one must specify Figure 3. Marr and Nishisharas 3-D model description for a human shape. A shows how the whole shape is decomposed into a hierarchy of models, each en closed by a rectangle. B shows the information contained in the model description: the subordinate models contained in each superordinate, and the location and orientation of the defining axis of each subordinate with respect to a coordinate system centered on a part of the superordinate. The meanings of the symbols used in the model are illustrated in C and D: the endpoint of a subordinate axis is defined by three parameters in a cylindrical coordinate system centered on a superordinate part (left panel of C); the orientation and length of the subordinate axis are defined by three paramet ers in a spherical coordinate system centered on the endpoint and aligned with the superordinate part (right panel of C). Angles and lengths are specified by ranges rather than by exact values (D). From Marr and Nishih ara (1978). 22 S. Pinker the shape of the axis (i.e., how it bends, if at all), the two-dimensional shape of the generalized cone’s cross-section, and the gradient defining how its area changes as a function of position along the axis. (Marr and Nishihara point out that shapes formed by biological growth tend to be well-modeled by generalized cones, making them good candidates for internal representations of the shapes of such parts.) In addition, surface primitives such as rectangu lar, circular, or bloblike markings can also be specified in terms of their positions with respect to the axis model. Deriving 3-D descriptions from the 2lh-D sketch Unfortunately, this is an aspect of the Marr and Nishihara model that has not been developed in much detail. Marr and Nishihara did outline a limited process for deriving 3-D descriptions from the two-dimensional silhouette of the object. The process first carves the silhouette into parts at extrema of curvature, using a scheme related to the one proposed by Hoffman and Richards (1984). Each part is given an axis coinciding with its direction of elongation, and lines are created joining endpoints to neighboring axes. The angles between axes and lines are measured and recorded, the resulting de scription is matched against top-level models in memory, and the bestmatched model is chosen. At that point, constraints on how a part is situated and oriented with respect to the superordinate axis in that model can be used to identify the viewer-relative orientation of the part axis in the 2V2-D sketch. That would be necessary if the orientation of that part cannot be determined by an examination of the sketch itself, such as when its axis is pointing toward the viewer and hence is foreshortened. Once the angle of an axis is specified more precisely, it can be used in selecting subordinate 3-D models for sub sequent matching. The Marr and Nishihara model is the most influential contemporary model of three-dimensional shape recognition, and it is not afflicted by many of the problems that afflict the textbook models of shape representation sum marized earlier. Nonetheless, the model does have a number of problems, which largely define the central issues to be addressed in current research on shape recognition. In the next section, I summarize some of these problems briefly. Current problems in shape recognition research Choice o f shape primitives to represent parts The shape primitives posited by Marr and Nishihara—generalized cones centered on axes of elongation or symmetry—have two advantages: they can Visual cognition 23 easily characterize certain important classes of objects, such as living things, and they can easily be derived from their silhouettes. But Hoffman and Richards (1984) point out that many classes of shapes cannot be easily de scribed in this scheme, such as faces, shoes, clouds, and trees. Hoffman and Richards take a slightly different approach to the representation of parts in a shape description. They suggest that the problem of describing parts (i.e., assigning them to categories) be separated from the problem of finding parts (i.e., determining how to carve an object into parts). If parts are only found by looking for instances of certain part categories (e.g., generalized cones) then parts that do not belong to any of those categories would never be found. Hoffman and Richards argue that, on the contrary, there is a psychologically plausible scheme for finding part boundaries that is ignorant of the nature of the parts it defines. The parts delineated by these boundaries at each scale can be categorized in terms of a taxonomy of lobes and blobs based on the patterns of inflections and extrema of curvature of the lobe’s surface. (Hoffman (1983) has worked out a taxonomy for primitive shape descriptors, called ‘codons’, for two-dimensional plane curves). They argue not only that the decomposition of objects into parts is more basic for the purposes of recognition than the description of each part, but that the deriva tion of part boundaries and the classification of parts into sequences of codon like descriptors might present fewer problems than the derivation of axisbased descriptions, because the projective geometry of extrema and inflec tions of curvature allows certain reliable indicators of these extrema in the image to be used as a basis for identifying them (see Hoffman, 1983). Another alphabet of shape primitives that has proven useful in computer vision consists of a set of canonical volumetric shapes such as spheres, parallelopipeds, pyramids, cones, and cylinders, with parameterized sizes and (possibly) aspect ratios, joined together in various ways to define the shape of an object (see e.g., Hollerbach, 1975; Badler and Bajcsy, 1978). It is unlikely that a single class of primitives will be sufficient to characterize all shapes, from clothes lying in a pile to faces to animals to furniture. That means that the derivation process must be capable of determining prior to describing and recognizing a shape which type of primitives are appropriate to it. There are several general schemes for doing this. A shape could be described in parallel in terms of all the admissible representational schemes, and descriptions in inappropriate schemes could be rejected because they are unstable over small changes in viewing position or movement, or because no single description within a scheme can be chosen over a large set of others within that scheme. Or there could be a process that uses several coarse properties of an object, such as its movement, surface texture and color, dimensionality, or sound to give it an initial classification into broad cate- 24 S. Pinker gories such as animal versus plant versus artifact each with its own scheme of primitives and their organization (e.g., see Richards (1979, 1982) on “playing 20 questions” with the perceptual input). Assigning frames o f reference to a shape In a shape representation, size, location, and orientation cannot be specified in absolute terms but only with respect to some frame of reference. It is convenient to think of a frame of reference as a coordinate system centered on or aligned with the reference object, and transformations within or between reference frames as being effected by an analogue of matrix multiplication taking the source coordinates as input and deriving the destina tion coordinates as output. However, a reference frame need not literally be a coordinate system. For example, it could be an array of arbitrarily labelled cells, where each cell represents a fixed position relative to a reference object. In that case, transformations within or between such reference frames could be effected by fixed connections among corresponding source and destination cells (e.g., a network of connections linking each cell with its neighbor to the immediate right could effect translation when activated iteratively; see e.g., Trehub, 1977). If a shape is represented for the purpose of recognition in terms of a coordinate system or frame of reference centered on the object itself, the shape recognition system must have a way of determining what the objectcentered frame of reference is prior to recognizing the object. Marr and Nishihara conjecture that a coordinate system used in recognition may be aligned with an object’s axes of elongation, bilateral symmetry, radial sym metry (for objects that are radially symmetrical in one plane and extended in an orthogonal direction), rotation (for jointed objects), and possibly linear movement. Each of these is suitable for aligning a coordinate system with an object because each is derivable prior to object recognition and each is fairly invariant for a type of object across changes in viewing position. This still leaves many problems unsolved. For starters, these methods only fix the orientation of one axis of the cylindrical coordinate system. The direc tion of the cylindrical coordinate system for that axis (i.e., which end is zero), the orientation of the zero point of its radial scale, and the handedness of the radial scale (i.e., whether increasing angle corresponds to going clockwise or counterclockwise around the scale) are left unspecified, as is the direction of one of the scales used in the spherical coordinate system specified within the cylindrical one (assuming its axes are aligned with the axis of the cylindrical system and the line joining it to the cylindrical system) (see Fig. 3c). Further more, even the choice of the orientation of the principal axis will be difficult when an object is not elongated or symmetrical, or when the principal axis Visual cognition 25 is occluded, foreshortened, or physically compressed. For example, it the top-level description of a cow shape describes the dispositions of its parts with respect to the cow’s torso, then when the cow faces the viewer the torso is not visible, so there is no way for the visual system to describe, say, the orientations of the leg and head axes relative to its axis. There is evidence that our assignment of certain aspects of frames of refer ence to an object is done independently of its intrinsic geometry. The posi tive-negative direction of an intrinsic axis, or the assignment of an axis to an object when there is no elongation or symmetry, may be done by computing a global up-down direction. Rock (1973, 1983) presents extensive evidence showing that objects’ shapes are represented relative to an up-down direc tion. For example, a square is ordinarily ‘described’ internally as having a horizontal edge at the top and bottom; when the square is tilted 45°, it is described as having vertices at the top and bottom and hence is perceived as a different shape, namely, a diamond. The top of an object is not, however, necessarily the topmost part of the object’s projection on the retina: Rock has shown that when subjects tilt their heads and view a pattern that, un known to them, is tilted by the same amount (so that it projects the same retinal image), they often fail to recognize it. In general, the up-down direc tion seems to be assigned by various compromises among the gravitational upright, the retinal upright, and the prevailing directions of parallelism, pointing, and bilateral symmetry among the various features in the environ ment of the object (Attneave, 1968; Palmer and Bucher, 1981; Rock, 1973). In certain circumstances, the front-back direction relative to the viewer may also be used as a frame of reference relative to which the shape is described; Rock et al. (1981) found that subjects would fail to recognize a previouslylearned asymmetrical wire form when it was rotated 90° about the vertical axis. What about the handedness of the angular scale in a cylindrical coordinate system (e.g., the 6 parameter in Fig. 3)? One might propose that the visual system employs a single arbitrary direction of handedness for a radial scale that is uniquely determined by the positive-negative direction of the long axis orthogonal to the scale. For example, we could use something analogous to the ‘right hand rule’ taught to physics students in connection with the orien tation of a magnetic field around a wire (align the extended thumb of your right hand with the direction of the flow of current, and look which way your fingers curl). There is evidence, however, that the visual system does not use any such rule. Shepard and Hurwitz (1984, in this issue; see also Hinton and Parsons, 1981; Metzler and Shepard, 1975) point out that we do not in general determine how parts are situated or oriented with respect to the left-right direction on the basis of the intrinsic geometry of the object (e.g., when we are viewing left and right hands). Rather, we assign the object a left-right 26 S. Pinker direction in terms of our own egocentric left and right sides. When an object’s top and bottom do not correspond to an egocentric or gravitational top-bot tom direction, we mentally rotate it into such an orientation, and when two unfamiliar objects might differ in handedness, we rotate one into the orienta tion of the other (taking greater amounts of time for greater angles of rota tion. Mental rotation is discussed further later in this paper). Presumably this failure to assign objects intrinsic left and right directions is an evolutionary consequence of the fact that aside from human artifacts and body parts, virtually no class of ecologically significant shapes need be distinguished from their enantiomorphs (Corballis and Beale, 1976; Gardner, 1967). To the extent that a shape is described with respect to a reference frame that depends on how the object is oriented with respect to the viewer or the environment, shape recognition will fail when the object moves with respect to the viewer or environment. In cases where we do succeed at recognizing objects across its different dispositions and where object-centered frames cannot be assigned, there are several possible reasons for such success. One is that multiple shape descriptions, corresponding to views of the object with different major axes occluded, are stored under a single label and correspond ing parts in the different descriptions are linked. Another is that the represen tation of the object is rotated into a canonical orientation or until the descrip tion of the object relative to the frame matches a memorized shape descrip tion; alternatively, the reference frame or canonical orientation could be rotated into the orientation of the object. Interestingly, there is evidence from Cooper and Shepard (1973) and Shepard and Hurwitz (1984) that the latter option (rotating an empty reference frame) is difficult or impossible for humans to do: advance information about the orientation of an upcoming visual stimulus does not spare the perceiver from having to rotate the stimulus mentally when it does appear in order to judge its handedness.2 A third possibility stems from Hoffman and Richards’s (1984) suggestion that part segmentation may be independent of orientation, and that only the represen tations of spatial relations among parts are orientation-sensitive. If so, recog nition of an isolated part can be used as an index to find the objects in memory that contain that part. Finally, in some cases recognition might fail outright with changes in orientation but the consequences might be innocu- ‘Hinton and Parsons (1981) have shown that when the various stimuli to be judged all conform to a single shape schema (e.g., alphanumeric characters with a vertical spine and strokes attached to the right side of the spine, such as ‘R’, ‘L \ and ‘F ) , advance information about orientation saves the subject from having to rotate the stimulus. However, it is possible that in their experiment subjects rotated a concrete image of a vertical spine plus a few strokes, rather than an empty reference frame. Visual cognition 27 ous. Because of the pervasiveness of gravity, many shapes will rarely be seen in any position but the upright (e.g., faces, trees), and many of the differences in precise shape among objects lacking axes of symmetry, movement, rota tion, or elongation are not ecologically significant enough for us to distinguish among them in memory (e.g., differences among bits of gravel or crumpled newspaper). Naturally, to the extent that any of the suggestions made in this paragraph are true, the importance of Marr and Nishihara’s argument for canonical object-centered descriptions lessens.3 Frames o f reference for the visual field We not only represent the shapes of objects internally; we also represent the locations and orientations of objects and surfaces in the visual field. The frames of reference that we use to represent this information will determine the ease with which we can make various spatial judgments. The relevant issues are the alignment of the frames of reference, and the form of the frames of reference. Early visual representations are in a viewer-centered and approximately spherical frame of reference; that is, our eyes give us information about the world in terms of the azimuth and elevation of the line of sight at which the features are found relative to the retina, and their distance from the viewing position (this is the coordinate system used for the 2V2-D sketch). Naturally, this is a clumsy representation to use in perceiving invariant spatial relations, since the information will change with eye movements. The system can com pensate for eye movements by superimposing a head-centered coordinate system on top of the retinal system and moving the origin of that coordinate system in conjunction with eye movement commands. Thus every cell in the 2V2-D sketch would be represented by the fixed ‘address’ defined with respect to the retina, and also by its coordinates with respect to the head, which would be dynamically adjusted during eye movements so that fixed locations in the world retain a constant coordinate address within the head-centered system. A third coordinate system, defined over the same information, could represent position with respect to the straight ahead direction of the body Specifying the origin of the object-centered coordinate system presents a slightly different set of issues than specifying the orientation of its axes. An origin for an object-centered frame can be determined by finding its visual center of mass or by assigning it to one end of a principal axis. It is noteworthy that there are no obvious cases where we fail to recognize an object when it is displaced, where we see a shape as ambiguous by virtue of assigning different ‘centers’ or ‘reference locations’ to it (analogous to the diamond/tilted square ambiguity), or where we have to mentally translate an object in order to recognize it or match it against a comparison object. This indicates either that the procedure that assigns an origin to an object on the basis of its intrinsic geometry always yields a unique solution for an object, or that, as Hinton (1979a) suggests, we do not compute an origin at all in shape descriptions, only a set of significant directions. 28 S. Pinker and it could be updated during head movements to represent the invariant position of surfaces across those movements. Other coordinate systems could be defined over these visible surface representations as well, such as coordi nate systems aligned with the gravitational upright and horizontal ground (see Shepard and Hurwitz, 1984), with fixed salient landmarks in the world, or with the prevailing directions of large surfaces (e.g., the walls in a tilted room). These coordinate systems for objects’ positions with respect to one’s body or with respect to the environment could be similar to those used to represent the parts of an object with respect to the object as a whole. Presum ably they are also like coordinate systems for objects’ shapes in being or ganized hierarchically, so that a paper clip might be represented by its posi tion with respect to the desk tray it is in, whose position is specified with respect to the desk, whose position is specified with respect to the room. Beyond the visual world, the origin and orientation of large frames of refer ence such as that for a room could be specified in a hierarchy of more schema tic frames of reference for entities that cannot be seen in their entirety, such as those for floor plans, buildings, streets, neighborhoods and so on (see e.g., Kuipers, 1978; Lynch, 1960; McDermott, 1980). The possible influence of various frames of reference on shape perception can be illustrated by an unpublished experiment by Laurence Parsons and Geoff Hinton. They presented subjects with two Shepard-Metzler cube fi gures, one situated 45° to the left of the subject, another at 45° to the right. The task was to turn one of the objects (physically) to whatever orientation best allowed the subject to judge whether the two were identical or whether one was a mirror-reversed version of the other (subjects were allowed to move their heads around the neck axis). If objects were represented in coor dinate systems centered upon the objects themselves, subjects would not have to turn the object at all (we known from the Shepard and Metzler studies that this is quite unlikely to be true for these stimuli). If objects are represented in a coordinate system aligned with the retina, subjects should turn one object until the corresponding parts of the two objects are perpen dicular to the other, so that they will have the same orientations with respect to their respective lines of sight. And if shapes are represented in a coordinate system aligned with salient environmental directions (e.g., the walls), one object would be turned until its parts are parallel to those of the other, so that they will have the same orientations with respect to the room. Parsons and Hinton found that subjects aligned one object so that it was nearly paral lel with another, with a partial compromise toward keeping the object’s reti nal projections similar (possibly so that corresponding cube faces on the two objects would be simultaneously visible). This suggests that part orientations are represented primarily with respect to environmentally-influenced frames. Visual cognition 29 The choice of a reference object, surface, or body part is closely tied to the format of the coordinate system aligned with the frame of reference, since rotatable objects (such as the eyes) and fixed landmarks easily support coor dinate systems containing polar scales, whereas reference frames with ortho gonal directions (e.g., gravity and the ground, the walls of a room) easily support Cartesian-like coordinate systems. The type of coordinate system employed has effects on the ease of making certain spatial judgments. As mentioned, the 2V2-D sketch represents information in a roughly spherical coordinate system, with the result that the easiest information to extract concerning the position of an edge or feature is its distance and direction with respect to the vantage point. As Marr (1982) points out, this representation conceals many of the geometric properties of surfaces that it would be desir able to have easy access to; something closer to a Cartesian coordinate system centered on the viewer would be much handier for such purposes. For exam ple, if two surfaces in different parts of the visual field are parallel, their orientations as measured in a spherical coordinate system will be different, but their orientations as measured in a coordinate system with a parallel component (e.g., Cartesian) will be the same (see Fig. 4). If a surface is flat, the represented orientations of all the patches composing its surface will be identical in Cartesian, but not in spherical coordinates. Presumably, size con stancy could also be a consequence of such a coordinate system, if a given range of coordinates in the left-right or up-down directions always stood for Figure 4. Effects o f rectangular versus polar coordinate systems on making spatial judgments. Whether two surfaces are parallel can be assessed by comparing their angles with respect to the straight ahead direction in a rectangular coordinate system (b), but not by comparing their angles with respect to the lines o f sight in a polar system (a). From Marr (1982). (a) (b) 30 S. Pinker a constant real world distance regardless of the depth of the represented surface. One potentially relevant bit of evidence comes from a phenomenon studied by Corcoran (1977), Natsoulas (1966), and Kubovy et cil. (1984, Reference note 1). When an asymmetric letter such as ‘d' is traced with a finger on the back of a person's head, the person will correctly report what the letter is. But when the same letter is traced on the person's forehead, the mirror image of that letter is reported instead (in this case, *b'). This would follow if space (and not just visible space) is represented in a parallel coordinate system aligned with a straight ahead direction, such as that shown in Fig. 4b. The handedness of a letter would be determined by whether its spine was situated to the left or right of the rest of its parts, such that ‘left’ and ‘right’ would be determined by a direction orthogonal to the straight ahead direction, regard less of where on the head the letter is drawn. The phenomenon would not be expected in an alternative account, where space is represented using spher ical coordinates centered on a point at or behind the eyes (e.g., Fig. 4a), because then the letter would be reported as if ‘seen’ from the inside of a transparent skull, with letters traced on the back of the head reported as mirror-reversed, contrary to fact. In many experiments allowing subjects to choose between environmental, Cartesian-like reference frames and egocentric, spherical reference frames, subjects appear to opt for a compromise (e.g., the Parsons and Hinton and Kubovy et al. studies; see also Attneave, 1972; Gilinsky, 1955; Uhlarik et al. 1980). It is also possible that we have access to both systems, giving rise to ambiguities when a single object is alternatively represented in the two sys tems, for example, when railroad tracks are seen either as parallel or as converging (Boring, 1952; Gibson, 1950; Pinker, 1980a), or when the corner formed by two edges of the ceiling of a room can be seen both as a right angle and as an obtuse angle. Deriving shape descriptions One salient problem with the Marr and Nishihara model of shape recogni tion in its current version is that there is no general procedure for deriving an object-centered 3-D shape description from the 2V2-D sketch. The al gorithm proposed by Marr and Nishihara using the two-dimensional silhouette of a shape to find its intrinsic axes has trouble deriving shape descriptions when axes are foreshortened or occluded by other parts of the object (as Marr and Nishihara pointed out). In addition, the procedures it uses for joining up part boundaries to delineate parts, to find axes of parts once they are delineated, and to pair axes with one another in adjunct rela tions rely on some limited heuristics that have not been demonstrated to work other than for objects composed of generalized cones—but the per- Visual cognition 31 ceiver cannot in general know prior to recognition whether he or she is viewing such an object. Furthermore, there is no explicit procedure for group ing together the parts that belong together in a single hierarchical level in the 3-D model description. Marr and Nishihara suggest that all parts lying within a ‘coarse spatial context’ surrounding an axis can be placed within the scope of the model specific to that axis, but numerous problems could arise when unrelated parts are spatially contiguous, such as when a hand is resting on a knee. Some of these problems perhaps could be resolved using an essentially similar scheme when information that is richer than an object’s silhouette is used. For example, the depth, orientation, and discontinuity information in the 2V2-D sketch could assist in the perception of foreshortened axes (though not when the blunt end of a tapering object faces the viewer squarely), and information about which spatial frequency bandpass channels an edge came from could help in the segregation of parts into hierarchical levels in a shape description. A general problem in deriving shape representations from the input is that, as mentioned, the choice of the appropriate reference frame and shape primi tives depends on what type of shape it is, and shapes are recognized via their description in terms of primitives relative to a reference frame. In the remain der of this section I describe three types of solutions to this chicken-and-egg problem. Top-down processing One response to the inherent difficulties of assigning descriptions to objects on the basis of their retinal images is to propose that some form of ancillary information based on a person’s knowledge about regularities in the world is used to choose the most likely description or at least to narrow down the options (e.g., Gregory, 1970; Lindsay and Norman, 1977; Minsky, 1975; Neisser, 1967). For example, a cat-owner could recognize her cat upon seeing only a curved, long, grey stripe extending out from underneath her couch, based on her knowledge that she has a long-tailed grey cat that enjoys lying there. In support of top-down, or, more precisely, knowledge-guided percep tual analysis, Neisser (1967), Lindsay and Norman (1977), and Gregory (1970) have presented many interesting demonstrations of possible retinal ambiguities that may be resolved by knowledge of physical or object-specific regularities, and Biederman (1972), Weisstein and Harris (1974) and Palmer (1975b) and others have shown that the recognition of an object, a part of an object, or a feature can depend on the identity that we attribute to the context object or scene as a whole. Despite the popularity of the concept of top-down processing within cog nitive science and artificial intelligence during much of the 1960s and 1970s, 32 S. Pinker there are three reasons to question the extent to which general knowledge plays a role in describing and recognizing shapes. First, many of the supposed demonstrations of top-down processing leave it completely unclear what kind of knowledge is brought to bear on recognition (e.g., regularities about the geometry of physical objects in general, about particular objects, or about particular scenes or social situations), and how that knowledge is brought to bear (e.g., altering the order in which memory representations are matched against the input, searching for particular features or parts in expected places, lowering the goodness-of-fit threshold for expected objects, generating and fitting templates, filling in expected parts). Fodor (1983) points out that these different versions of the top-down hypothesis paint very different pictures of how the mind is organized in general: if only a restricted type of knowledge can influence perception in a top-down manner, and then only in restricted ways, the mind may be constructed out of independent modules with re stricted channels of communication among them. But if all knowledge can influence perception, the mind could consist of an undifferentiated knowl edge base and a set of universal inference procedures which can be combined indiscriminately in the performance of any task. Exactly which kind of topdown processing is actually supported by the data can make a big difference in one’s conception of how the mind works; Fodor argues that so far most putative demonstrations of top-down phenomena are not designed to distin guish among possible kinds of top-down processing and so are uninformative on this important issue. A second problem with extensive top-down processing is that there is a great deal of information about the world that is contained in the light array, even if that information cannot be characterized in simple familiar schemes such as templates or features (see Gibson, 1966, 1979; Marr, 1982). Given the enormous selection advantage that would be conferred on an organism that could respond to what was really in the world as opposed to what it expected to be in the world whenever these two descriptions were in conflict, we should seriously consider the possibility that human pattern recognition has the most sophisticated bottom-up pattern analyses that the light array and the properties of our receptors allow. And as Ullman (1984, this issue) points out, we do appear to be extremely accurate perceivers even when we have no basis for expecting one object or scene to occur rather than another, such as when watching a slide show composed of arbitrary objects and scenes. Two-stage analysis o f objects Ullman (1984) suggests that our visual systems may execute a universal set of ‘routines’ composed of simple processes operating on the 2V2-D sketch, such as tracing along a boundary, filling in a region, marking a part, and Visual cognition 33 sequentially processing different locations. Once universal routines arc exe cuted, their outputs could characterize some basic properties of the promi nent entities in the scene such as their rough shape and spatial relationships. This characterization could then trigger the execution of routines specific to the recognition of particular objects or classes of objects. Because routines can be composed of arbitrary sequences of very simple but powerful proces ses, it might be possible to compile both a core of generally useful routines, plus a large set of highly specific routines suitable for the recognition of very different classes of objects, rather than a canonical description scheme that would have to serve for every object type. (In Ullman’s theory visual routines would be used not only for the recognition of objects but also for geometric reasoning about the surrounding visual environment such as determining whether one object is inside another or counting objects.) Richards (1979, 1982) makes a related proposal concerning descriptions for recognition, spec ifically, that one might first identify various broad classes of objects such as animal, vegetable, or mineral by looking for easily sensed hallmarks of these classes such as patterns of movement, color, surface texture, dimensionality, even coarse spectral properties of their sounds. Likely reference frames and shape primitives could then be hypothesized based on this first-stage categori zation. Massively parallel models There is an alternative approach that envisions a very different type of solution from that suggested by Richards, and that advocates very different types of mechanisms from those described in this issue by Ullman. Attneave (1982), Hinton (1981) and Hrechanyk and Ballard (1982) have outlined re lated proposals for a model of shape recognition using massively parallel networks of simple interconnected units, rather than sequences of operations performed by a single powerful processor (see Ballard et al. 1983; Feldman and Ballard, 1982; Hinton and Anderson, 1981), for introductions to this general class of computational architectures). A favorite analogy for this type of computation (e.g., Attneave, 1982) is the problem of determining the shape of a film formed when an irregularly shaped loop of wire is dipped in soapy water (the shape can be characterized by quantizing the film surface into patches and specifying the height of each patch). The answer to the problem is constrained by the ‘least action’ princi ple ensuring that the height of any patch of the film must be close to the heights of all its neighboring patches. But how can this information be used if one does not know beforehand the heights of all the neighbors of any patch? One can solve the problem iteratively, by assigning every patch an arbitrary initial height except for those patches touching the wire loop, which 34 S. Pinker are assigned the same heights as the piece of wire they are attached to. Then the heights of each of the other patches is replaced by the average height of its neighbors. This is repeated through several iterations; eventually the array of heights converges on a single set of values corresponding to the shape of the film, thanks to constraints on height spreading inward from the wire. The solution is attained without knowing the height of any single interior patch a priori, and without any central processor. Similarly, it may be possible to solve some perceptual problems using networks of simple units whose excitatory and inhibitory interconnections lead the entire network to converge to states corresponding to certain geometric constraints that must be satisfied when recognition succeeds. Marr and Poggio (1976) proposed such a ‘cooperative’ model for stereo vision that simultaneously finds the relative distance from the viewer of each feature in pair of stereoscopic images and which feature in one image corresponds with a given feature in the other. It does so by exploiting the constraints that each feature must have a single disparity and that neighboring features mostly have similar disparities. In the case of three-dimensional shape recognition, Attneave, Hinton, and Hrechanyk and Ballard point out that there are constraints on how shape elements and reference frames may be paired that might be exploitable in parallel networks to arrive at both simultaneously. First, every part of an object must be described with respect to the same object-centered reference frame (or at least, every part of an object in a circumscribed region at a given scale of decomposition; see the discussion of the Marr and Nishihara model). For example, if one part is described as the front left limb of a animal standing broadside to the viewer and facing to the left, another part of the same object cannot simultaneously be described as the rear left limb of that animal facing to the right. Second, a description of parts relative to an object-centered frame is to be favored if that description corresponds to an existing object de scription in memory. For example, a horizontal part will be described as the downward-pointing leg of a chair lying on its back rather than as the forward facing leg of an unfamiliar upright object. These constraints, it is argued, can be used to converge on a unique correct object-centered description in a network of the following sort. There is a retina-based unit for every possible part at every retinal size, location, and orientation. There is also an object-based unit for every orientation, location, and size of a part with respect to an object axis. Of course, these units cannot be tied to individual retina-based units, but each object-based unit can be connected to the entire set of retina-based units that are geometrically consis tent with it. Every shape description in memory consists of a shape unit that is connected to its constituent object-based units. Finally, all the pairs of Visual cognition Figure 5. 35 A portion o f a massively parallel network model for shape recognition. Triangular symbols indicate special multiplicative connections: the product o f activation levels o f a retina-based and a mapping unit is transmitted to an object-based unit, and the product o f the activation levels in those retinabased and object-based units is transmitted to the mapping unit. Trom Hin ton (1981). s h a p e units r m a p p in g \ u n its J object- and retina-based units that correspond to a single orientation of the object axis relative to the viewer are themselves tied together by a mapping unit, such that the system contains one such unit for each possible spatial relation between object and viewer. An example of such a network, taken from Hinton (1981), is shown in Fig. 5. The system’s behavior is characterized as follows. The visual input activates retina-based units. Retina-based units activate all the object-based units they 36 S. Pinker are connected to (this will include all object-based units that are geometrically compatible with the retinal features, including units that are inappropriate for the current object). Object-based units activate their corresponding shape units (again, both appropriate and inappropriate ones). Joint activity in par ticular retina- and object-based units activate the mapping units linking the two, that is, the mapping units that represent vantage points (relative to an object axis) for which those object-based features project as those retinabased features. Similarly, joint activity in retina-based and mapping units activate the corresponding object-based units. Shape units activate their cor responding object-based units; and (presumably) shape units inhibit other shape units and mapping units inhibit other mapping units. Hinton (1981) and Hrechanyk and Ballard (1982) argue that such networks should enter into a positive feedback loop converging on a single active shape unit, repre senting the recognized shape, and a single active mapping unit, representing the orientation and position of its axis with respect to the viewer, when a familiar object is viewed. In general, massively parallel models are effective at avoiding the search problems that accompany serial computational architectures. In effect, the models are intended to assess the goodness-of-fit between all the transforma tions of an input pattern and all the stored shape descriptions in parallel, finding the pair with the highest fit at the same time. Since these models are in their infancy, it is too early to evaluate the claims associated with them. Among other things, it will be necessary to determine: (a) whether the model can be interfaced to preprocessing systems that segregate an object from its background and isolate sets of parts belonging to a single object-centered frame at a single scale; (b) whether the initial activation of object-based units by retina-based units is selective enough to head the network down a path leading toward convergence to unique, correct solutions; (c) whether the number of units and interconnections among units needed to represent the necessary combinations of shapes, parts, and their dispositions is neurologically feasible; and (d) whether these networks can overcome their current difficulty at representing and recognizing relations among parts in complex objects and scenes, in addition to the parts themselves. Visual imagery Visual imagery has always been a central topic in the study of cognition. Not only is it important to understand our ability to reason about objects and scenes that are remembered rather than seen, but the study of imagery is tied to the question of the number and format of mental representations, and of Visual cognition 37 the interface between perception and cognition. Imagery may also be a par ticularly fruitful topic for study among the higher thought processes because of its intimate connection with shape recognition, benefitting from the prog ress made in that area. Finally, the subject of imagery is tied to scientific and literary creativity, mathematical insight, and the relation between cognition and emotion (see the papers in Sheikh, 1983); though the scientific study o