Shimon Edelman
Department of Psychology
232 Uris Hall, Cornell University
Ithaca, NY 14853-7601, USA
-
Nathan Intrator
School of Computer Science
Tel Aviv University
Ramat Aviv 69978, Israel
Hummel's reference to ``countless other ways'' in which our model of structure processing is deficient, along with the claim that it cannot account for ``any of the hallmarks of the human ability to perceive spatial relations,'' exemplifies the tactic favored by the defenders of the structural description (SD) approach across cognition: broaden the scope of the discussion to reduce the impact of a refutation of a particular instance of the theory. A similar situation prevails in linguistics, where no amount of evidence seems to suffice to send transformational generative grammar to the dumpster: the theory of grammar keeps mutating over the decades [Edelman and Christiansen, 2003], and there is always some ``performance-related'' excuse for sticking to rule-based explanations. This will not do: the proponents of SD should decide whether or not their theory is refutable in principle in the context of a well-defined set of observations, such as those offered in [Hummel, 2000] and a well-defined model, such as that of [Hummel and Biederman, 1992] or [Hummel, 2001].
Refuting the SD approach was not, however, our goal in the article commented upon by Hummel. Likewise, the ability of SD models to deal with any entities standing in any relation to each other is not ignored by us (as Hummel claims in the penultimate paragraph). On the contrary, it is precisely this property of SD that makes them too powerful (hence biologically irrelevant, unless properly constrained), as we discuss at length in section 1.2. Consequently, we set out to develop a limited, yet conceptually and computationally plausible alternative to SD, to implement it using biologically uncontroversial mechanisms -- a category that includes what+where neurons, but excludes dynamic binding [Kirschfeld, 1995,Lennie, 1998] -- and to test the implementation on real, manually unpreprocessed images (something that no SD model is capable of at present). In this admittedly circumscribed project we have succeeded.
We note that Hummel's equation of what+where with conjunctive coding betrays a basic misunderstanding of our approach. It is the combination of what+where code at one level with a bottom-up/top-down ``computation cube'' scheme that enables our model to exhibit (limited) systematicity. This computational stance leads to predictions of analogous limitations in human performance, which we state in section 5.4. We also note that the crucial psychophysical experiments have yet to be performed; we are aware of no published studies of the representation of structure that use unfamiliar-looking or even just complex shapes and configurations while discouraging scrutiny and abstract, extra-visual reasoning on the part of the subject.
In summary, we believe (appropriating Mahatma Gandhi's quip concerning the Western civilization) that SD would be a good idea -- (1) if the visual world were uniquely describable in terms of compositions of parts, (2) if human observers were systematic in their processing of novel arrangements of unfamiliar visual stimuli, and (3) if the detection of the parts and relations were computationally feasible. Unfortunately, the stance expressed by the first condition is philosophically naive [Akins, 1996,Smith, 2001], the second is empirically unfounded, and the third is increasingly perceived by the computer vision community as unrealistic [Barrow and Tenenbaum, 1993,Dickinson et al., 1997]. It looks like limited systematicity may prove to be the only game in town.