Tomaso Poggio and Shimon Edelman
Nature 343:263-266, 1990
A 3D object gives rise to an infinite variety of 2D images or views, because of the infinite number of possible poses relative to the viewer, and because of arbitrarily different illumination conditions. Is it possible to synthesize a module that can recognize an object from any viewpoint, after it learns its 3D structure from a small set of perspective views? We show that a recently proposed network scheme for the approximation of multivariate functions provides a key part of the solution to the problem. The results are especially interesting as an application of a new technique for learning from examples. They also have implications for computer vision and possibly for understanding the process of object recognition in natural vision.
[get this paper]
Shimon Edelman
Biological Cybernetics 72:207-220, 1995.
In human vision, the processes and the representations involved in identifying specific individuals are frequently assumed to be different from those used for basic-level classification, because classification is largely viewpoint-invariant, but identification is not. This assumption was tested in psychophysical experiments, in which objective similarity between stimuli (and, consequently, the level of their distinction) varied in a controlled fashion. Subjects were trained to discriminate between two classes of computer generated 3D objects, one resembling monkeys, and the other dogs. Both classes were defined by the same set of 56 parameters, which encoded sizes, shapes, and placement of the limbs, the ears, the snout, etc. Interpolation between parameter vectors of the class prototypes yielded shapes that changed smoothly between monkey and dog. Within-class variation was induced in each trial by randomly perturbing all the parameters. After the subjects reached 90% correct performance on a fixed canonical view of each object, discrimination performance was tested for novel views that differed by up to 60 deg from the training view. In experiment 1 (in which the distribution of parameters in each class was unimodal) and in experiment 2 (bimodal classes), the stimuli differed only parametrically and consisted of the same geons (parts), yet were recognized virtually independently of viewpoint in the low-similarity condition. In experiment 3, the prototypes differed in their arrangement of geons, yet the subjects' performance depended significantly on viewpoint in the high-similarity condition. In all three experiments, higher inter-stimulus similarity was associated with an increase in the mean error rate and, for misorientation of up to 45 deg, with an increase in the degree of viewpoint dependence. These results suggest that a geon-level difference between stimuli is neither strictly necessary nor always sufficient for viewpoint-invariant performance. Thus, basic and subordinate-level processes in visual recognition may be more closely related than previously thought.
[get this paper]
Shimon Edelman and Heinrich H. Bülthoff
Vision Research 32:2385-2400, 1992.
How does the human visual system represent and recognize novel three-dimensional objects? Variation in response time over different views of objects, obtained in subordinate-level recognition tasks, hints that objects may be represented by collections of specific views, rather than by viewpoint-independent models. We report results of four experiments that provide further evidence in support of the viewpoint-specific representation hypothesis. In the first experiment we tested the recognition of objects seen repeatedly from the same set of viewpoints. Although the response times in this experiment became uniform with practice, the differences in error rate for the different views remained stable. In the second experiment, this result was replicated in the presence of a variety of depth cues in the test views, including binocular stereo. In the third experiment, recognition under monocular and stereoscopic conditions was compared over four testing sessions. In those two experiments, we found that the addition of stereo depth reduced the mean error rate, but did not affect the general pattern of performance over different views, and its development with practice. Finally, the fourth experiment probed the ability of subjects to generalize recognition to unfamiliar views of objects previously seen at a limited range of attitudes, both under mono and stereo. The same increase in the error rate with misorientation relative to the training attitude was obtained in the two conditions. Taken together, these results support the notion that 3D objects are represented by multiple specific views, possibly augmented by partial viewer-centered three-dimensional information, if it is available through stereopsis.
[get this paper]
Yair Weiss, Shimon Edelman, and Manfred Fahle
Neural Computation 5:695-718, 1993.
Performance of human subjects in a wide variety of early visual processing tasks improves with practice. HyperBF networks (Poggio and Girosi, Science, 247:978-982, 1990) constitute a mathematically well-founded framework for understanding such improvement in performance, or perceptual learning, in the class of tasks known as visual hyperacuity. The present article concentrates on two issues raised by the recent psychophysical and computational findings reported in Poggio et al., Science 256:1018-1021 (1992). First, we develop a biologically plausible extension of the HyperBF model that takes into account basic features of the functional architecture of early vision. Second, we explore various learning modes that can coexist within the HyperBF framework and focus on two unsupervised learning rules which may be involved in hyperacuity learning. Finally, we report results of psychophysical experiments that are consistent with the hypothesis that activity-dependent presynaptic amplification may be involved in perceptual learning in hyperacuity.
[get this paper]
Shimon Edelman and Daphna Weinshall
Biological Cybernetics, 64:209-219, 1991.
We explore representation of 3D objects in which several distinct 2D views are stored for each object. We demonstrate the ability of a two-layer network of thresholded summation units to support such representations. Using unsupervised Hebbian relaxation, the network learned to recognize ten objects from different viewpoints. The training process led to the emergence of compact representations of the specific input views. When tested on novel views of the same objects, the network exhibited a substantial generalization capability. In simulated psychophysical experiments, the network's behavior was qualitatively similar to that of human subjects.
[get this paper]
Florin Cutzu and Shimon Edelman
Vision Research, 34:3037-3056, 1994.
Human performance in the recognition of 3D objects, as measured by response times and error rates, frequently depends on the orientation of the object with respect to the observer. We investigated the dependence of response time (RT) and error rate (ER) on stimulus orientation for a class of random wire-like objects. First, we found no evidence for universally valid canonical views: the best view according to one subject's data was often hardly recognized by other subjects. Second, a subject by subject analysis showed that the RT/ER scores were not linearly dependent on the shortest angular distance in 3D to the best view, as predicted by the mental rotation theories of recognition. Rather, the performance was significantly correlated with an image-plane feature by feature deformation distance between the presented view and the best (shortest-RT and lowest-ER) view. Our results suggest that measurement of image-plane similarity to a few (subject-specific) feature patterns is a better model than mental rotation for the mechanism used by the human visual system to recognize objects across changes in their 3D orientation.
[get this paper]
Shimon Edelman and Daphna Weinshall
in Perceptual Constancies, V. Walsh and J. Kulikowski, eds., Cambridge U. Press, 1994 (to appear).
The appearance of a three-dimensional object (that is, the pattern formed by its projection onto the retina of an eye or onto the imaging plane of a camera) depends on the point of view of the observer. The collective human awareness of this dependence is attested to by the widespread use of expressions that involve the metaphor of point of view, in languages as different as English, Russian, and Hebrew. Nevertheless, as far as recognition is concerned, the matters of viewpoint seem to be of secondary importance: the human visual system exhibits an impressive ability to recognize a familiar object viewed from an unfamiliar perspective. This phenomenon has been termed shape constancy, by analogy with other perceptual constancies.
Computational understanding of shape constancy can be gained both by attempting to build artificial vision systems for object recognition, and by modeling human performance in this task. Maintaining a constant interpretation of the three-dimensional world in the face of changing viewing conditions has long been a major goal of computer vision. The first part of this chapter classifies and reviews several approaches to 3D object recognition developed within this field. In the second part of the chapter, we list the central characteristics of shape constancy in human vision, and compare the virtues and the shortcomings of the computational approaches, this time considered as models of human performance. We conclude with a general discussion of the phenomenon of shape constancy within the framework of the computational study of perception.
[get this paper]
Shimon Edelman
Minds and Machines 5:45-68, 1995.
It is proposed to conceive of representation as an emergent phenomenon that is supervenient on patterns of activity of coarsely tuned and highly redundant feature detectors. The computational underpinnings of the outlined concept of representation are (1) the properties of collections of overlapping graded receptive fields, as in the biological perceptual systems that exhibit hyperacuity-level performance, and (2) the sufficiency of a set of proximal distances between stimulus representations for the recovery of the corresponding distal contrasts between stimuli, as in multidimensional scaling. The present preliminary study appears to indicate that this concept of representation is computationally viable, and is compatible with psychological and neurobiological data.
[get this paper]
Yair Weiss and Shimon Edelman
Weizmann Institute CS-TR 93-09; also Network 6:19-41, 1995.
We consider the representational capabilities of systems of receptive fields found in early mammalian vision, under the assumption that the successive stages of processing remap the retinal representation space in a manner that makes objectively similar stimuli (such as different views of the same 3D object) closer to each other, and dissimilar stimuli farther apart. We present theoretical analysis and computational experiments that compare the similarity between stimuli as they are represented at the successive levels of the processing hierarchy, from the retina to the nonlinear cortical units. Our results indicate that the representations at the higher levels of the hierarchy are indeed more useful for the classification of natural objects such as human faces.
[get this paper]
Shimon Edelman
Biological Cybernetics, 70:37-45, 1993.
Idealized models of receptive fields (RFs) can be used as building blocks for the creation of powerful distributed computation systems. The present report concentrates on investigating the utility of collections of RFs in representing 3D objects under changing viewing conditions. The main requirement in this task is that the pattern of activity of RFs vary as little as possible when the object and the camera move relative to each other. I propose a method for representing objects by RF activities, based on the observation that, in the case of rotation around a fixed axis, differences of activities of RFs that are properly situated with respect to that axis remain invariant. Results of computational experiments suggest that a representation scheme based on this algorithm for the choice of stable pairs of RFs would perform consistently better than a scheme involving random sets of RFs. The proposed scheme may be useful under object or camera rotation, both for ideal Lambertian objects, and for real-world objects such as human faces.
[get this paper]
Kalanit Grill Spector, Shimon Edelman, and Rafael Malach
Proc. NIPS'94.
The maximization of diversity of neuronal response properties has been recently suggested as an organizing principle for the formation of such prominent features of the functional architecture of the brain as the cortical columns and the associated patchy projection patterns. We report a computational study of two aspects of this hypothesis. First, we show that maximal diversity is attained when the ratio of dendritic and axonal arbor sizes is equal to one, as it has been found in many cortical areas and across species. Second, we show that maximization of diversity leads to better performance in two case studies: in systems of receptive fields implementing steerable/shiftable filters, and in matching spatially distributed signals, a problem that arises in visual tasks such as stereopsis, motion processing, and recognition.
[get this paper]
Yael Moses, Shimon Ullman, and Shimon Edelman
Weizmann CS-TR 93-14, 1993 (also Perception, 1996).
An image of a face depends not only on its shape, but also on the viewing position, illumination conditions, and facial expression. Any face recognition system must overcome the changes in face appearance induced by these factors. In this paper we address two questions: how well humans can indeed generalize the recognition of faces to novel images, and at which computational level this generalization is performed. To answer these questions we studied the performance of subjects in face discrimination task, and we compared it for upright and inverted faces. For upright faces, we found remarkably good generalization to novel conditions (i.e., new illumination and viewpoint). For inverted faces, the generalization to novel views was significantly worse, although the performance on the training images was similar in both cases.
Our results indicate that at least some of the processes that support generalization across viewpoint and illumination are neither universal (because subjects did not generalize as easily for inverted faces as for upright ones), nor strictly object-specific (because in upright faces nearly perfect generalization was possible from a single view, by itself insufficient for building a complete object-specific model). We propose that generalization in face recognition occurs at an intermediate level that is applicable to a class of objects, and that at this level upright and inverted faces initially constitute distinct object classes.
[get this paper]
Shimon Edelman
CVGIP:IU, 60:92-94, 1994.
According to the paradigmatic reconstructionist approach to vision, a visual system must first reconstruct the world internally, then extract from the resulting representation whatever features are necessary for the task at hand. Recent developments in computational vision and visual neuroscience show that many of the features needed for tasks ranging from spatial discrimination to object recognition can be extracted from the image directly, much as in Gibson's hypothesis of direct perception. In the emerging synthesis between Gibson's position and that of Marr, representation, and not necessarily reconstruction, plays a central role. This new synthesis seems to constitute a reasonable compromise between the extreme version of the purposive vision credo, which, paraphrasing Brooks, is vision without representation, and the reigning paradigm of reconstruction without purpose.
[get this paper]
Yacov Hel-Or and Shimon Edelman
Proc. ICPR'94, Jerusalem, 1994, A:316-320.
Nonmetric multidimensional scaling (MDS) is a family of algorithms that allow one to derive a quantitative representation of data from a set of qualitative measurements which must satisfy certain simple constraints. As a tool for vision, MDS combines the advantages of both qualitative and classical approaches, by relying, on the one hand, on an ordinal-scale input representation, and by supporting, on the other hand, the extraction of metric information. The present paper illustrates an application of MDS to the recovery of depth from the rank order of binocular disparity differences for a set of points. Our results indicate that multidimensional scaling constitutes a promising approach to the integration of biological and computational insights into the problem of depth perception.
[get this paper]
Shimon Edelman
Neural Computation 7:407-422, 1995.
How does the brain represent visual objects? In simple perceptual generalization tasks, the human visual system performs as if it represents the stimuli in a low-dimensional metric psychological space. In theories of 3D shape recognition, the role of feature-space representations (as opposed to structural or pictorial descriptions) has been for a long time a major point of contention. If shapes are indeed represented as points in a feature space, patterns of perceived similarity among different objects must reflect the structure of this space. The feature space hypothesis can then be tested by presenting subjects with complex parameterized 3D shapes, and by relating the similarities among subjective representations, as revealed in the response data by multidimensional scaling, to the objective parameterization of the stimuli. The results of four such tests, accompanied by computational simulations, support the notion that discrimination among 3D objects may rely on a low-dimensional feature space representation, and suggest that this space may be spanned by explicitly encoded class prototypes.
[get this paper]
Shimon Edelman
Psycoloquy 5:50, Sept. 25, 1994
The computational building blocks of biological information processing systems are highly interconnected networks of simple units with graded overlapping receptive fields, arranged in maps. In view of this basic constraint, it is proposed that the present stage in the study of cognition should concentrate on gaining understanding of the cognitive system at the level of the distributed computational mechanism. The model of script understanding introduced in the target book ["Subsymbolic Natural Language Processing", Risto Miikkulainen, Cambridge, MA: MIT Press, 1993] appears promising, both because it treats seriously the question of architecture of the language processor, and because its architectural features resemble those used in modeling other cognitive modalities such as vision.
[get this paper]
Maria Lando and Shimon Edelman
Weizmann Institute CS-TR 95-02, 1995 (also Proc. IWAFGR'95, June 1995, Zurich, and Network, 6:551-576, 1995).
We describe a computational model of face recognition, which generalizes from single views of faces, by taking advantage of prior experience with other faces, seen under a wider range of viewing conditions. The model represents face images by vectors of activities of graded overlapping receptive fields (RFs). It relies on high spatial frequency information to estimate the viewing conditions, which are then used to normalize (via a transformation specific for faces), and identify, the low spatial frequency representation of the input. The class-specific transformation approach allows the model to replicate a series of psychophysical findings on face recognition, and constitutes an advance over current face recognition methods, which are incapable of generalization from a single example.
[get this paper]
Florin Cutzu and Shimon Edelman
Weizmann Institute CS-TR 95-01, 1995.
Using a small number of prototypical reference objects to span the internal shape representation space has been suggested as a general approach to the problem of object representation in vision. We have investigated the ability of human subjects to form the low-dimensional metric shape representation space predicted by this approach. In each of a series of experiments, which involved pairwise similarity judgment, and delayed match to sample, subjects were confronted with several classes of computer-rendered 3D animal-like shapes, arranged in a complex pattern in a common high-dimensional parameter space. We combined response time and error rate data into a measure of view similarity, and submitted the resulting proximity matrix to nonmetric multidimensional scaling (MDS). In the two-dimensional MDS solution, views of the same shape were invariably clustered together, and, in each experiment, the relative geometrical arrangement of the view clusters of the different objects reflected the true low-dimensional structure in parameter space (star, triangle, square, line) that defined the relationships between the stimuli classes. These findings are now used used to guide the development of a detailed computational theory of shape vision based on similarity to prototypes.
[get this paper; a shorter version is here]
Sharon Duvdevani-Bar and Shimon Edelman
Weizmann Institute CS-TR 95-11, 1995.
A representational scheme under which the ranking between represented dissimilarities is isomorphic to the ranking between the corresponding shape dissimilarities can support perfect shape classification, because it preserves the clustering of shapes according to the natural kinds prevailing in the external world. We discuss the computational requirements of rank-preserving representation, and examine its plausibility within a prototype-based framework of shape vision.
[get this paper]
(a commentary on D. Amit, "The Hebbian paradigm reintegrated")
Shimon Edelman
Behavioral and Brain Sciences, 18:630-631, December 1995.
A theory of representation is incomplete if it states "representations are X" where X can be symbols, cell assemblies, functional states, or the flock of birds from Theaetetus, without explaining the nature of the link between the universe of X's and the world. Amit's thesis, equating representations with reverberations in Hebbian cell assemblies, will only be considered a solution to the problem of representation when it is complemented by a theory of how a reverberation in the brain can be a representation of anything.
[get this paper]
Nathan Intrator and Shimon Edelman
Connection Science, 1996.
We consider training classifiers for multiple tasks as a method for improving generalization and obtaining a better low-dimensional representation. To that end, we introduce a hybrid training methodology for MLP networks; the utility of the hidden-unit representation is assessed by embedding it into a 2D space using multidimensional scaling. The proposed methodology is tested on a highly nonlinear image classification task.
[get this paper]
Shimon Edelman
Weizmann Institute CS-TR 95-29, 1995.
Many of the lower-level areas in the mammalian visual system are organized retinotopically, that is, as maps which preserve to a certain degree the topography of the retina. A unit that is a part of such a retinotopic map normally responds selectively to stimulation in a well-delimited part of the visual field, referred to as its receptive field (RF). Receptive fields are probably the most prominent and ubiquitous computational mechanism employed by biological information processing systems. This paper surveys some of the possible computational reasons behind the ubiquity of RFs, by discussing examples of RF-based solutions to problems in vision, from spatial acuity, through sensory coding, to object recognition.
[get this paper]
Shimon Edelman and Sharon Duvdevani-Bar
Neural Computation, 1997
A representational scheme under which the ranking between represented similarities is isomorphic to the ranking between the corresponding shape similarities can support perfectly correct shape classification, because it preserves the clustering of shapes according to the natural kinds prevailing in the external world. This note discusses the computational requirements of representation that preserves similarity ranks, and points out the straightforwardness of its connectionist implementation.
[get this paper]
Heinrich H. Bülthoff and Shimon Edelman
PNAS 89:60-64, 1992
Does the human brain represent objects for recognition by storing a series of two-dimensional snapshots, or are the object models, in some sense, three-dimensional analogs of the objects they represent? One way to address this question is to explore the ability of the human visual system to generalize recognition from familiar to novel views of three-dimensional objects. Three recently proposed theories of object recognition --- viewpoint normalization or alignment of 3D models (Ullman, 1989), linear combination of 2D views (Ullman and Basri, 1991) and nonlinear view interpolation (Poggio and Edelman, 1990) --- predict different patterns of generalization to novel views. We have exploited the conflicting predictions to test the three theories directly, in a psychophysical experiment involving computer-generated wire-like objects. Our results suggest that the human visual system is better described as recognizing these objects by nonlinear 2D view interpolation than by alignment or other methods that rely on object-centered 3D models.
[get this paper]
Shimon Edelman, Florin Cutzu, and Sharon Duvdevani-Bar
Proc. COGSCI'96
We present a unified approach to visual representation, addressing both the needs of superordinate and basic-level categorization and of identification of specific instances of familiar categories. According to the proposed theory, a shape is represented by its similarity to a number of reference shapes, measured in a high-dimensional space of elementary features. This amounts to embedding the stimulus in a low-dimensional proximal shape space. That space turns out to support representation of distal shape similarities which is veridical in the sense of Shepard's (1968) notion of second-order isomorphism (i.e., correspondence between distal and proximal similarities among shapes, rather than between distal shapes and their proximal representations). Furthermore, a general expression for similarity between two stimuli, based on comparisons to reference shapes, can be used to derive models of perceived similarity ranging from continuous, symmetric, and hierarchical, as in the multidimensional scaling models (R. N. Shepard, 1980), to discrete and non-hierarchical, as in the general contrast models (A. Tversky, 1977; R. N. Shepard and P. Arabie, 1979).
[get this paper]
Yael Karov and Shimon Edelman
Weizmann CS-TR 96-06, 1996 (also in Computational Linguistics, 24:41-59, 1998)
We describe a method for automatic word sense disambiguation using a text corpus and a machine-readable dictionary (MRD). The method is based on word similarity and context similarity measures. Words are considered similar if they appear in similar contexts; contexts are similar if they contain similar words. The circularity of this definition is resolved by an iterative, converging process, in which the system learns from the corpus a set of typical usages for each of the senses of the polysemous word listed in the MRD. A new instance of a polysemous word is assigned the sense associated with the typical usage most similar to its context. Experiments show that this method can learn even from very sparse training data, achieving over 92% correct disambiguation performance.
[get this paper]
Shimon Edelman
Weizmann CS-TR 96-08, 1996; to appear in Behavioral and Brain Sciences
Intelligent systems are faced with the problem of securing a principled (ideally, veridical) relationship between the world and its internal representation. I propose a unified approach to visual representation, addressing both the needs of superordinate and basic-level categorization and of identification of specific instances of familiar categories. According to the proposed theory, a shape is represented by its similarity to a number of reference shapes, measured in a high-dimensional space of elementary features. This amounts to embedding the stimulus in a low-dimensional proximal shape space. That space turns out to support representation of distal shape similarities which is veridical in the sense of Shepard's (1968) notion of second-order isomorphism (i.e., correspondence between distal and proximal similarities among shapes, rather than between distal shapes and their proximal representations). Furthermore, a general expression for similarity between two stimuli, based on comparisons to reference shapes, can be used to derive models of perceived similarity ranging from continuous, symmetric, and hierarchical, as in the multidimensional scaling models (Shepard, 1980), to discrete and non-hierarchical, as in the general contrast models (Tversky, 1977; Shepard and Arabie, 1979).
[get this paper]
Florin Cutzu and Shimon Edelman
Vision Research, 1997.
We report results from perceptual judgment, delayed matching to sample, and long-term memory recall experiments, which indicate that the human visual system can support metrically veridical representations of similarities among 3D objects. In all the experiments, animal-like computer-rendered stimuli formed regular planar configurations in a common 70-dimensional parameter space. These configurations were fully recovered by multidimensional scaling from proximity tables derived from the subject data. This is possible if shapes are encoded by their similarities to a number of reference (prototypical) shapes (as in the computational model that accompanies the psychophysical data), but not if the system stores merely the distinctive features of the objects, or their structural descriptions (which were the same for all the stimuli).
[get this paper]
Shimon Edelman, Heinrich H. Bülthoff, and Isabelle Bülthoff
Spatial Vision, 12:107-123, 1999
To explore the nature of the representation space of 3D objects, we studied human performance in forced-choice classification of objects composed of four geon-like parts, emanating from a common center. The two class prototypes were distinguished by qualitative contrasts (bulging vs.\ waist-like limbs). Subjects were trained to discriminate between the two prototypes (shown briefly, from a number of viewpoints, in stereo) in a 1-interval forced-choice task, until they reached a 90% correct-response performance level. In the first experiment, 11 subjects were tested on shapes obtained by varying the prototypical parameters both orthogonally (Ortho) and in parallel (Para) to the line connecting the prototypes in the parameter space. For the eight subjects who performed above chance, the error rate increased with the Ortho parameter-space displacement between the stimulus and the corresponding prototype (the effect of the Para displacement was marginal). Clearly, the parameter-space location of the stimuli mattered more than the qualitative contrasts (which were always present). To find out whether both prototypes or just the nearest neighbor of the test shape influenced the decision, in the second experiment we tested 18 new subjects on a fixed set of shapes, while the test-stage distance between the two classes assumed one of three values (Far, Intermediate, and Near). For the 13 subjects who performed above chance, the error rate (on physically identical stimuli) in the Near condition was higher than in the other two conditions. The results of the two experiments contradict the prediction of theories that postulate exclusive reliance on qualitative contrasts, and support the notion of a metric representation space, with the subjects' performance determined by distances to more than one reference point or prototype.
[get this paper]
Nathan Intrator and Shimon Edelman
Network, 1997.
Learning to recognize visual objects from examples requires the ability to find meaningful patterns in spaces of very high dimensionality. We present a method for dimensionality reduction which effectively biases the learning system by combining multiple constraints via an extensive use of class labels. The use of multiple class labels steers the resulting low-dimensional representation to become invariant to those directions of variation in the input space that are irrelevant to classification; this is done merely by making class labels independent of these directions. We also show that prior knowledge of the proper dimensionality of the target representation can be imposed by training a multiple-layer bottleneck network. A series of computational experiments involving parameterized fractal images and real human faces indicate that the low-dimensional representation extracted by our method leads to improved generalization in the learned tasks, and is likely to preserve the topology of the original space.
[get this paper]
Shimon Edelman and Nathan Intrator
in Mechanisms of Perceptual Learning, D. Medin, R. Goldstone, and P. Schyns, eds., 1997
Psychophysical findings accumulated over the past several decades indicate that perceptual tasks such as similarity judgment tend to be performed on a low-dimensional representation of the sensory data. Low dimensionality is especially important for learning, as the number of examples required for attaining a given level of performance grows exponentially with the dimensionality of the underlying representation space. In this chapter, we argue that, whereas many perceptual problems are tractable precisely because their intrinsic dimensionality is low, the raw dimensionality of the sensory data is normally high, and must be reduced by a nontrivial computational process, which, in itself, may involve learning. Following a survey of computational techniques for dimensionality reduction, we show that it is possible to learn a low-dimensional representation that captures the intrinsic low-dimensional nature of certain classes of visual objects, thereby facilitating further learning of tasks involving those objects.
[get this paper]
Shimon Edelman, Nathan Intrator, and Tomaso Poggio
manuscript
Nearest-neighbor correlation-based similarity computation in the space of outputs of complex-type receptive fields can support robust recognition of 3D objects. Our experiments with four collections of objects resulted in mean recognition rates between 84% (for subordinate-level discrimination among 15 quadruped animal shapes) and 94% (for basic-level recognition of 20 everyday objects), over a 40deg X 40deg range of viewpoints, centered on a stored canonical view and related to it by rotations in depth (comparable figures were obtained for image-plane translations). This result has interesting implications for the design of a front end to an artificial object recognition system, and for the understanding of the faculty of object recognition in primate vision.
[get this paper]
Marcus Dill and Shimon Edelman
Perception, 2001, in press.
The positional specificity of short-term visual memory for a variety of 3D shapes was investigated in a series of same-different discrimination experiments, using computer-rendered stimuli displayed either at the same or at different locations in the visual field. For animal-like shapes, we found complete translation invariance, regardless of the inter-stimulus similarity, and irrespective of direction and size of the displacement (experiments 1 and 2). Invariance to translation was obtained also with animal-like stimuli that had been ``scrambled'' by randomizing the relative locations of their parts (experiment 3). The invariance broke down when the stimuli were made to differ in their composition, but not in the shapes of the corresponding parts (experiments 4 and 5). We interpret this pattern of findings in the context of several current theories of recognition, focusing in particular on the issue of the representation of object structure.
[get this paper]
Shimon Edelman
Trends in Cognitive Sciences, 1997.
Visual categorization, or making sense of novel shapes and shape classes, is a computationally challenging and behaviorally important task, which is not widely addressed in computer vision or visual psychophysics (where the stress is rather on the generalization of recognition across changes of viewpoint). This paper examines the categorization abilities of four current approaches to object representation: structural descriptions, geometric models, multidimensional feature spaces, and similarities to reference shapes. It is proposed that a scheme combining features of all four approaches is a promising candidate for a comprehensive and computationally feasible theory of categorization.
[get this paper]
Sharon Duvdevani-Bar and Shimon Edelman
Intl. J. of Computer Vision, 33:201-228, 1999.
One of the difficulties of object recognition stems from the need to overcome the variability in object appearance caused by factors such as illumination and pose. The influence of these factors can be countered by learning to interpolate between stored views of the target object, taken under representative combinations of viewing conditions. Difficulties of another kind arise in daily life situations that require categorization, rather than recognition, of objects. We show that, although categorization cannot rely on interpolation between stored examples, knowledge of several representative members, or prototypes, of each of the categories of interest can still provide the necessary computational substrate for the categorization of new instances. The resulting representational scheme based on similarities to prototypes is computationally viable, and is readily mapped onto the mechanisms of biological vision revealed by recent psychophysical and physiological studies.
[get this paper]
Shimon Edelman and Sharon Duvdevani-Bar
Proc. Edinburgh Workshop on Similarity and Categorization, 75-81, November 1997.
Visual objects can be represented by their similarities to a small number of reference shapes or prototypes. This method yields low-dimensional (and therefore computationally tractable) representations, which support both the recognition of familiar shapes and the categorization of novel ones. In this note, we show how such representations can be used in a variety of tasks involving novel objects: viewpoint-invariant recognition, recovery of a canonical view, estimation of pose, and prediction of an arbitrary view. The unifying principle in all these cases is the representation of the view space of the novel object as an interpolation of the view spaces of the reference shapes.
[get this paper]
Shimon Edelman and Fiona Newell
COGS CSRP 500, University of Sussex, November 1998
Theories of object representation can be classified as structural, holistic or hybrid, depending on their approach to the mereology and compositionality of shapes. We tested the predictions of some of the current theories in three experiments, by quantifying the effects of various priming cues on response times to 3D objects. In experiment~1, there were two possible locations for the stimulus components: left-right and top-bottom. The prime could be identical to the stimulus, identical in location but with different parts, identical in the complement of differently located parts, or altogether different. Both location and part identity effects were significant. In experiment~2 we added a part-neutral (empty frame) prime condition; the effect of location, but not of part, remained significant. In experiment~3, which included an additional location-neutral prime condition, only the location effect, again, was significant. These findings are not entirely compatible either with the structural description theories of representation (which predict priming by ``disembodied'' parts or geons) or with the holistic theories (which do not predict priming by ``shapeless'' location on its own). They may be interpreted in terms of a hybrid theory, according to which conjunctions of shape and location are explicitly represented, and therefore amenable to priming.
[get this paper]
Shimon Edelman and Nathan Intrator
Spatial Vision, 13:255-264, 2000.
The ability to deal with object structure --- to determine what is where in a given object, rather than merely to categorize or identify it --- has been hitherto considered the prerogative of ``structural description'' approaches, which represent shapes as categorical compositions of generic parts taken from a small alphabet. In this note, we propose a simple extension to a theoretically motivated and extensively tested appearance-based model of recognition and categorization, which should make it capable of representing object structure. We describe a pilot implementation of the extended model, survey independent evidence supporting its {\it modus operandi}, and outline a research program focused on achieving a range of object processing capabilities, including reasoning about structure, within a unified appearance-based framework.
[get this paper]
Shimon Edelman, K. Grill-Spector, T. Kushnir, and R. Malach
Psychobiology, 26:309-321, 1998
Reports of columnar organization of macaque inferotemporal cortex (Tanaka 1992, Tanaka 1993) indicate that ensembles of cells responding to particular objects may be both sufficiently extensive and properly localized to allow their detection and discrimination by means of functional magnetic resonance imaging (fMRI). A recently developed theory of object representation by ensembles of coarsely tuned units (Edelman and Duvdevani-Bar, 1997; Edelman, 1998) and its implementation as a computer model of recognition and categorization (Cutzu and Edelman, 1998; Edelman and Duvdevani-Bar, 1997) provide a computational framework in which such findings can be interpreted in a straightforward fashion. Taken together, these developments in the study of object representation and recognition suggest that direct visualization of the internal representations may be easier than previously thought. In this paper, we show how fMRI techniques can be used to investigate the internal representation of objects in human visual cortex. Our initial results reveal that the activation of most voxels in object-related areas remains unaffected by a coarse scrambling of the natural images used as stimuli, and that a map of the representation space of object categories in individual subjects can be derived from the distributed pattern of voxel activation in those areas.
[get this paper]
Shimon Edelman and Nathan Intrator
Proc. NIPS*2000, 10-16, MIT Press, 2001
We describe a unified framework for the understanding of structure representation in primate vision. A model derived from this framework is shown to be effectively systematic in that it has the ability to interpret and associate together objects that are related through a rearrangement of common ``middle-scale'' parts, represented as image fragments. The model addresses the same concerns as previous work on compositional representation through the use of what+where receptive fields and attentional gain modulation. It does not require prior exposure to the individual parts, and avoids the need for abstract symbolic binding.
[get this paper]
Shimon Edelman
Journal of Biological Systems, 6, 265-280, 1998
The paper outlines a computational approach to face representation and recognition, inspired by two major features of biological perceptual systems: graded-profile overlapping receptive fields, and object-specific responses in the higher visual areas. This approach, according to which a face is ultimately represented by its similarities to a number of reference faces, led to the development of a comprehensive theory of object representation in biological vision, and to its subsequent psychophysical exploration and computational modeling.
[get this paper]
Shimon Edelman, Benjamin P. Hiles, Hwajin Yang and Nathan Intrator
Proc. NIPS*2001, MIT Press, 2002 (in press)
To find out how the representations of structured visual objects depend on the co-occurrence statistics of their constituents, we exposed subjects to a set of composite images with tight control exerted over (1) the conditional probabilities of the constituent fragments, and (2) the value of Barlow's criterion of ``suspicious coincidence'' (the ratio of joint probability to the product of marginals). We then compared the part verification response times for various probe/target combinations before and after the exposure. For composite probes, the speedup was much larger for targets that contained pairs of fragments perfectly predictive of each other, compared to those that did not. This effect was modulated by the significance of their co-occurrence as estimated by Barlow's criterion. For lone-fragment probes, the speedup in all conditions was generally lower than for composites. These results shed light on the brain's strategies for unsupervised acquisition of structural information in vision.
[get this paper]
Shimon Edelman
Trends in Cognitive Sciences 6:125-131, 2002
Understanding the perception of all but the most impoverished and artificial scenes presents a different (and likely far greater) kind of challenge than understanding face recognition, reading, or identification (or even categorization) of standalone objects. This article surveys central issues in the interpretation of structured objects and scenes (starting with basics, such as the meaning of seeing), and outlines a theoretical approach to this formidable task, motivated by some recent developments in neuroscience and neurophilosophy.
[get this paper]
Shimon Edelman and Nathan Intrator
Cognitive Science 27:73-110 (2003)
The problem of representing the spatial structure of images, which arises in visual object processing, is commonly described using terminology borrowed from propositional theories of cognition, notably, the concept of compositionality. The classical propositional stance mandates representations composed of symbols, which stand for atomic or composite entities and enter into arbitrarily nested relationships. We argue that the main desiderata of a representational system --- productivity and systematicity --- can (indeed, for a number of reasons, should) be achieved without recourse to the classical, proposition-like compositionality. We show how this can be done, by describing a systematic and productive model of the representation of visual structure, which relies on static rather than dynamic binding and uses coarsely coded rather than atomic shape primitives.
[get this paper]
Zach Solan, Eytan Ruppin, David Horn and Shimon Edelman
Proc. NIPS*2002, MIT Press, 2003
The principle of complementary distributions (Harris, 1954; Harris, 1991), according to which morphemes that occur in identical contexts belong, in some sense, to the same category, has been advanced as a means for extracting syntactic structures from corpus data. We extend this principle by applying it recursively, and by using mutual information for estimating category coherence. The resulting model learns, in an unsupervised fashion, highly structured, distributed representations of syntactic knowledge from corpora. It also exhibits promising behavior in tasks usually thought to require representations anchored in a grammar, such as systematicity.
[get this paper]
Shimon Edelman, Nathan Intrator and Judah S. Jacobson
Lecture Notes in Computer Science, vol. 2025, H. H. Bülthoff, T. Poggio, S. W. Lee and C. Wallraven, eds., 629-643, Springer, 2002
To learn a visual code in an unsupervised manner, one may attempt to capture those features of the stimulus set that would contribute significantly to a statistically efficient representation (as dictated, e.g., by the Minimum Description Length principle). Paradoxically, all the candidate features in this approach need to be known before statistics over them can be computed. This paradox may be circumvented by confining the repertoire of candidate features to actual scene fragments, which resemble the ``what+where'' receptive fields found in the ventral visual stream in primates. We describe a single-layer network that learns such fragments from unsegmented raw images of structured objects. The learning method combines fast imprinting in the feedforward stream with lateral interactions to achieve single-epoch unsupervised acquisition of spatially localized features that can support systematic treatment of structured objects.
[get this paper]
Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman
Proc. NIPS*2003, MIT Press, 2004 (in press)
We describe a pattern acquisition algorithm that learns, in an unsupervised fashion, a streamlined representation of linguistic structures from a plain natural-language corpus. This paper addresses the issues of learning structured knowledge from a large-scale natural language data set, and of generalization to unseen text. The implemented algorithm represents sentences as paths on a graph whose vertices are words (or parts of words). Significant patterns, determined by recursive context-sensitive statistical inference, form new vertices. Linguistic constructions are represented by trees composed of significant patterns and their associated equivalence classes. An input module allows the algorithm to be subjected to a standard test of English as a Second Language (ESL) proficiency. The results are encouraging: the model attains a level of performance considered to be ``intermediate'' for 9th-grade students, despite having been trained on a corpus (CHILDES) containing transcribed speech of parents directed to small children.
[get this paper]
Shimon Edelman, Zach Solan, David Horn and Eytan Ruppin
invited for Syntax, Semantics and Statistics, a NIPS*2003 workshop
We compare our model of unsupervised learning of linguistic structures, ADIOS (Solan et al, NIPS'03), to some recent work in computational linguistics and in grammar theory. Our approach resembles the Construction Grammar in its general philosophy (e.g., in its reliance on structural generalizations rather than on syntax projected by the lexicon, as in the current generative theories), and the Tree Adjoining Grammar in its computational characteristics (e.g., in its apparent affinity with Mildly Context Sensitive Languages). The representations learned by our algorithm are truly emergent from the (unannotated) corpus data, whereas those found in published works on cognitive and construction grammars and on TAGs are hand-tailored. Thus, our results complement and extend both the computational and the more linguistically oriented research into language acquisition. We conclude by suggesting how empirical and formal study of language can be best integrated.
[get this paper]
Shimon Edelman
unpublished manuscript
Computer vision systems are, on most counts, poor performers, when compared to their biological counterparts. The reason for this may be that computer vision is handicapped by an unreasonable assumption regarding what it means to see, which became prevalent as the notions of intrinsic images and of representation by reconstruction took over the field in the late 1970's. Learning from biological vision may help us to overcome this handicap.
[get this paper]
Zach Solan, David Horn, Eytan Ruppin, and Shimon Edelman
Proc. 4th International Conference on Language Evolution, M. Tallerman, ed., Oxford University Press (to appear)
We examined the role of fitness, commonly assumed without proof to be conferred by the mastery of language, in shaping the dynamics of language evolution. To that end, we introduced island migration (a concept borrowed from population genetics) into the shared lexicon model of communication (Hurford, 1989; Nowak, 1999). The effect of fitness in language coherence was compared to a control condition of neutral drift. We found that in the neutral condition (no coherence-dependent fitness) even a small migration rate -- less than 1% -- suffices for one language to become dominant, albeit after a long time. In comparison, when fitness-based selection is introduced, the subpopulations stabilize quite rapidly to form several distinct languages. Our findings support the notion that language confers increased fitness. The possibility that a shared language evolved as a result of neutral drift appears less likely, unless migration rates over evolutionary times were extremely small.
[get this paper]
Zach Solan, David Horn, Eytan Ruppin, and Shimon Edelman
Proc. Natl. Acad. Sci. 102:11629-11634 (August 16, 2005)
We address the problem, fundamental to linguistics, bioinformatics and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (Automatic DIstillation of Structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This is the first time an unsupervised algorithm is shown capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.
[get this paper]
Shimon Edelman and Heidi Waterfall
Physics of Life Reviews (2008, in press)
One of the greatest challenges facing the cognitive sciences is to explain what it means to know a language, and how the knowledge of language is acquired. The dominant approach to this challenge within linguistics has been to seek an efficient characterization of the wealth of documented structural properties of language in terms of a compact generative grammar ideally, the minimal necessary set of innate, universal, exception-less, highly abstract rules that jointly generate all and only the observed phenomena and are common to all human languages. We review developmental, behavioral, and computational evidence that seems to favor an alternative view of language, according to which linguistic structures are generated by a large, open set of constructions of varying degrees of abstraction and complexity, which embody both form and meaning and are acquired through socially situated experience in a given language community, by probabilistic learning algorithms that resemble those at work in other cognitive modalities.
Shimon Edelman
Journal of Experimental and Theoretical AI (JETAI), in press.
Reverse-engineering the brain involves adopting and testing a hierarchy of working hypotheses regarding the computational problems that it solves, the representations and algorithms that it employs, and the manner in which these are implemented. Because problem-level assumptions set the course for the entire research program, it is particularly important to be open to the possibility that we have them wrong, but tacit algorithm- and implementation-level hypotheses can also benefit from occasional scrutiny. The present paper focuses on the extent to which our computational understanding of how the brain works is shaped by three such rarely discussed assumptions, which span the levels of Marr's hierarchy: (i) that animal behavior amounts to a series of stimulus/response bouts, (ii) that learning can be adequately modeled as being driven by the optimization of a fixed objective function, and (iii) that massively parallel, uniformly connected layered or recurrent network architectures suffice to support learning and behavior. In comparison, a more realistic approach acknowledges that animal behavior in the wild is characterized by dynamically branching serial order and is often agentic rather than reactive. Arguably, such behavior calls for open-ended learning of world structure and may require a neural architecture that includes precisely wired circuits reflecting the serial and branching structure of behavioral tasks.
Shimon Edelman
Chomsky's Legacy (Christina Behme, ed.)
What does it mean to know language? Since the Chomskian revolution, one popular answer to this question has been: to possess a generative grammar that exclusively licenses certain syntactic structures. Decades later, not even an approximation to such a grammar, for any language, has been formulated; the idea that grammar is universal and innately specified has proved barren; and attempts to show how it could be learned from experience invariably come up short. To move on from this impasse, we must rediscover the extent to which language is like any other human behavior: dynamic, social, multimodal, patterned, and purposive, its purpose being to promote desirable actions (or thoughts) in others and self. Recent psychological, computational, neurobiological, and evolutionary insights into the shaping and structure of behavior may then point us toward a new, viable account of language.
Shimon Edelman
Language Sciences, in press.
Similar to other complex behaviors, language is dynamic, social, multimodal, patterned, and purposive, its purpose being to promote desirable actions or thoughts in others and self (Edelman, 2017). An analysis of the functional characteristics shared by complex sequential behaviors suggests that they all present a common overarching computational problem: dynamically controlled constrained navigation in concrete or abstract situation spaces. With this conceptual framework in mind, I compare and contrast computational models of language and evaluate their potential for explaining linguistic behavior and for elucidating the brain mechanisms that support it.