From Perceptrons to Parallel Minds: The PDP Revolution in Cognitive Science

·

When we look back at the early history of machine learning, Frank Rosenblatt and Marvin Minsky built the first artificial learning machines in the 1950s. In those early systems, scientists manually designed the connections between the units.Minsky described one such machine as consisting of a retina — an array of binary input units arranged spatially — connected to a layer of predicate units, each computing simple local functions, and finally to one or more decision units. Only the connections to the decision units were adjustable during learning, making this a single-layer perceptron.

Minsky and Papert (1969) showed that single-layer perceptrons have fundamental limitations: they cannot compute certain problems, such as the exclusive-or (XOR) function.Their critique was mathematically correct — but at the time, there were no effective methods for training networks with multiple layers. As later work would show, a multilayer perceptron, which includes several intermediate layers of processing units, can compute far more complex functions.”As we shall see in the course of this book, the limitations of the one-step perceptron in no way apply to the more complex networks.”(1, p. 65)

“It can be shown that a multilayered perceptron system, including several layers of predicates between the retina and the decision stage, can compute functions such as parity, using reasonable numbers of units each computing a very local predicate… Essentially, then, although Minsky and Papert were exactly correct in their analysis of the one-layer perceptron, the theorems don’t apply to systems which are even a little more complex. In particular, it doesn’t apply to multilayer systems nor to systems that allow feedback loops” (1, p. 112).

By the time the PDP group began its work in the 1980s, it had become possible to test whether these multilayer models could indeed overcome the limitations identified by Minsky. The central question asked whether the interactions of a large number of simple processing units — each sending excitatory or inhibitory signals to others — could generate learning mechanisms, rather than relying on central programming or symbolic representations.

The PDP approach emphasizes that many traditional macro-level psychological constructs — such as schemata, prototypes, rules, and productions — emerge from the interactions among elements in a distributed network.

“The basic perspective of this book is that many of the constructs of macrolevel descriptions such as schemata, prototypes, rules, productions, etc. can be viewed as emerging out of interactions of the microstructure of distributed models” (1, p. 125).

In this sense, the PDP approach represents a connectionist computational framework within cognitive science. It offers an alternative to purely symbolic or serial models of cognition, and provides a bridge between psychological theories of learning and the biological mechanisms of the brain.

McClelland and colleagues argued that such an approach is not only conceptually elegant, but also biologically and computationally plausible:

“The biological hardware is just too sluggish for sequential models of the microstructure to provide a plausible account, at least of the microstructure of human thought. And the time limitation only gets worse, not better, when sequential mechanisms try to take large numbers of constraints into account. Each additional constraint requires more time in a sequential machine, and, if the constraints are imprecise, the constraints can lead to a computational explosion. Yet people get faster, not slower, when they are able to exploit additional constraints. Parallel distributed processing models offer alternatives to serial models of the microstructure of cognition. They do not deny that there is a macrostructure, just as the study of subatomic particles does not deny the existence of interactions between atoms” (1, p. 12).

The Structure of a Parallel Distributed Processing System

In many traditional models of cognition, knowledge is stored as a static copy of a particular pattern. Retrieving that knowledge simply means locating the stored pattern in long-term memory and bringing it into working memory. In such systems, there is little or no distinction between the representation held in storage and the one that becomes active during thought or perception.

In contrast, PDP models represent knowledge in a fundamentally different way. Rather than storing literal copies of patterns, the system encodes the relationships among its elements — that is, the connection strengths between units. The system preserves not the pattern itself but the configuration of weights that allows it to reconstruct the pattern when given the right inputs.In this view, remembering is not a matter of copying something back into consciousness but of re-creating it through the network’s internal dynamics.

Because knowledge is embedded in the connections themselves, learning in PDP systems means adjusting these connection weights so that the network produces appropriate patterns of activation under the appropriate circumstances. Learning, therefore, is a process of gradual tuning: through repeated exposure and feedback, the system modifies its internal connections until the desired patterns emerge naturally in response to given inputs.Since the system embeds knowledge in the connections themselves, learning in PDP systems involves adjusting these connection weights so the network produces the right patterns of activation under the correct circumstances.

Core Principles of the PDP Model

McClelland and Rumelhart outlined eight essential components that define the structure of a parallel distributed processing system (1, p. 46):

  1. A set of processing units that serve as the basic elements of the system.
  2. A state of activation associated with each unit.
  3. An output function determining how each unit’s activation influences others.
  4. A pattern of connectivity specifying how units are linked together.
  5. A propagation rule describing how activation spreads through the network.
  6. An activation rule defining how each unit updates its own activation based on incoming signals.
  7. A learning rule that governs how connection weights change through experience.
  8. An environment within which the system operates and learns.

These components together form the architecture of a PDP network, a system capable of dynamic, distributed computation.

Units, Representations, and Activation

The processing units in a PDP model can represent a wide range of things, depending on the level of analysis. In some models, a unit might correspond to a concrete concept such as a visual feature, a letter, or a word. In others, the units are more abstract — generic elements over which meaningful patterns can be distributed.

When we refer to a distributed representation, we mean that no single unit stands for an entire concept or object. Instead, the network represents each concept as a pattern of activation spread across many units, with each unit contributing to the representation of multiple concepts.There is no central executive or controlling module; rather, the system’s behavior emerges from the simultaneous interactions of numerous simple units, each performing its own small computation. A unit’s role is simply to receive signals from its neighbors, integrate them according to a defined rule, and update its own activation accordingly.

Units interact continuously, sending signals to one another. The strength of each outgoing signal depends on the sender’s level of activation, while the influence that signal has on the receiving unit depends on the connection weight between them. Typically, each unit contributes additively to the total input of its connected neighbors, meaning that the overall input to a given unit is the weighted sum of all the signals it receives. This input then determines how the unit’s activation changes over time, completing the cycle of parallel, distributed processing.

In PDP models, memory works very differently from conventional computers. Items are not stored as separate copies but as patterns of activation across the same network of units. A partial cue can trigger interactions among units, allowing the network to reconstruct the full pattern. Learning occurs by adjusting connection strengths so that new patterns can emerge, but no single unit “stores” an item — each connection participates in multiple patterns. Memory, in this view, is dynamic, distributed, and emergent, rather than fixed or localized.

Bibliography

  1. Rumelhart, David E., James L. McClelland, and the PDP Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1–2, MIT Press, 1986.
  2. Boden, Margaret A. Artificial Intelligence and Natural Man. 2nd ed., MIT Press, 1987.
  3. Boden, Margaret A. Artificial Intelligence. 2nd ed., Academic Press, 1990.
  4. Hebb, Donald O. The Organization of Behavior: A Neuropsychological Theory. Wiley, 1949.
  5. McCulloch, Warren S., and Walter Pitts. “A Logical Calculus of the Ideas Immanent in Nervous Activity.” The Bulletin of Mathematical Biophysics, vol. 5, 1943, pp. 115–133. https://doi.org/10.1007/BF02478259
  6. Minsky, Marvin, and Seymour Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, 1969.