In response to: “The Galilean Challenge” (Vol. 3, No. 1).

To the editors:

Language is an enabling miracle. It elevates human intelligence into a realm of transcendent self-actualization where no animal intelligence can hope to follow. The remarkable ideational outputs of writers, artists, musicians, scientists, and others reflect the stylized flowering of an internal language of thought that, in all cases, develops from the formative experience of language learning early in life. But diversified intellectual creativity, powered by an abstract language faculty, is not the only miracle of which humans are capable.

Late in the final round of the 2010 Masters golf tournament, Phil Mickelson was clinging to a one-shot lead. On the thirteenth hole, he pulled his tee shot badly and the ball came to rest on large patch of pine straw behind two gigantic trees. The opening between the trees was narrow, the footing uncertain, and the nearest tree restricted the follow through for any shot taken from this location. The green was 187 yards away with a water hazard, in the form of a creek, directly in front. The only sensible course of action from such a lie was to poke the ball back onto the fairway, suffer the extra shot, and move on. The situation confronting Mickelson was not something he could have anticipated. Golfers practice shots from awkward lies, but an infinitude of uncharted possibilities awaits any golfer who veers that far off the beaten path. Yet, in a more general sense, Mickelson’s entire athletic life served as prelude to that shot. Against the advice of his caddy and contrary to all conventional golf wisdom, he took the high-risk shot… and the ball ended up five feet from the flag. Mickelson went on to win the tournament.

What does this story have to do with language processing or the Galilean Challenge? Nothing specifically. But to the extent that our brains embody information-processing organs remarkably adept at utilizing the entirety of previous experience across a broad behavioral domain for the singular purpose of improving future performance in a potentially novel circumstance—well, maybe it has quite a lot to do with the Galilean Challenge. Indeed, we argue that something akin to the Galilean Challenge confronts most motile organisms within the sensorimotor domain, and, just as humans effortlessly solve the problem of language processing, animals solve the problem of sensorimotor control with an efficacy far exceeding that of artificially engineered systems. We return to this point later. For now, we address Noam Chomsky’s claim that there is a distinction between internal language and its sensorimotor externalization: “The internal faculty of language that allows linguistic computation is not dependent on the externalization of language mediated by our sensorimotor systems.”

The essay supports this contention by noting that certain universally-observed features of spoken language, such as structure-dependence, do not comport with the principles of communicative efficiency, as one might expect. Others have noted that certain features of speech—its ability to free up the hands while communicating, the long distances it travels, etc.—provided evolutionary advantages.1 We take a different tack. If indeed speech is one possible sensorimotor instantiation of a more universal mental propensity, then a rationale based on sensorimotor constraints must exist as to why the spoken word has been preferred throughout human history to, say, the signed word.2 We provide such a rationale.

The Need for Speed

For any communication system, transmission rate is critical. If we ignore pauses, human speech occurs at 10-15 phonemes per second (>1 phoneme every 100 ms) which amounts to roughly three to four words per second, an extremely rapid rate relative to the timescales of movement control. Indeed, if the goal of the speech motor system, like other end-effector systems, were only to generate movement trajectories along with specific end-postures, then the slow response time of muscles would preclude such rapidity. Speech is produced by the acoustic interaction of rapidly pulsed airflow as it passes from the larynx through a vocal tract whose configuration is dynamically changing as the result of speech articulator motion. The coupled system generates informative feature-based content in the acoustic signal, phonemes and sub-phonemes, on timescales shorter than gestural lengths. To a very gross approximation, vowels arise from the interaction of voiced airflow with specific vocal tract postures, while consonants embody specific gestural perturbations to the system as it navigates across vowels. It is, in part, the continuous conveyance of phonemic information across both movement targets and intermediary transitional states that enables the speech system to achieve its high throughput. Categorical perception of phonemes and top-down influences, such as familiarity with speech, also contribute to a rapid transmission rate.

A competing mode of sensorimotor communication would need to achieve a comparable transmission rate. The most obvious candidate would be a signing system. Any such system primarily involving movements of the arms, legs, or torso can be immediately discounted as too slow.3 Only a system utilizing relatively small excursions of low-inertia end-effectors could mediate a viable visuomotor alternative to speech, implying a system based on hand-finger movements. How would such a system compare in terms of speed?

First, we note that there exists a subset of sign languages called fingerspelling, in which each letter of the alphabet is accorded a unique hand/finger posture, or movement trajectory in some cases. The upper bound communication rate for such systems is approximately five letters per second, a value consistent with physiologic and biomechanical considerations.4 Fingerspelling thus realizes a communication rate roughly half that of speech. Whatever its speed, pure fingerspelling is hardly ever used,5 perhaps because it does not constitute a true language and instead merely maps the structures of a written language onto signing. How do the true sign languages work and what are their communication rates?

Sign languages such as American Sign Language (ASL) and British Sign Language (BSL) utilize extended two-handed movement trajectories to compose the fundamental language unit—a sign—which corresponds loosely to a word. These more elaborate gestures take longer to execute, occurring at roughly two signs per second or nearly half the rate at which words are generated in speech. Yet the rate at which a fixed series of propositions can be conveyed across the two modalities of communication is nearly equal.6 How is this possible? The answer is that the structure of sign language is naturally matched to the unique attributes of visuomotor space, so that fewer signs convey the same meaning. We argue that having a larger fundamental gestural unit—a sign is akin to a word, not a phoneme—makes sense in terms of reducing expression length through the incorporation of these linguistic devices.7

  1. Use of peripersonal space: Given the physical space in front of the signer, particular referents can be stored at certain locations and subsequently invoked. Peripersonal space is utilized like a personalized theater stage, obviating the need for certain signs to be repeated and acting as its own grammatical device.
  2. Modification of signed movements: Just as words can be inflected, the movements constituting signs can be modified in a wide variety of ways providing added linguistic information without the need for additional signs.
  3. Inclusion of facial expression: A parade of facial gestures and head movements accompany standard signing practice and provide a parallel channel for adding meaning.

Note that these devices require a word-level unit of expression to be effective since they act largely in parallel with the sign production process. The net result is a significant reduction in the number of signs needed per proposition, together with an uptick in the attentional burden placed on the viewer.8 In summary, when a one-dimensional acoustic space with a single production channel is swapped for a three-dimensional visual space with multiple production channels, the strategy of rapid serialized production of highly modularized speech commands is switched to a slower parallelized production of more holistic signed commands. Simply put, time is traded for space to equalize communication rate. From this perspective, the need for speed compelled the grammatical structure of sign language to self-organize into a less divisible form, whereby the fundamental communication unit—a sign—is tantamount to a word.9

Modularity and Early Learning

If sign language occurs at a rate similar to speech, why is speech so obviously preferred? The answer requires consideration of perception as well as production. Some of the perceptual benefits of speech are obvious. Acoustic signals are perspective-invariant, whereas proper viewing of a sign requires a largely unobstructed view from a limited range of angular perspectives. This makes it difficult to communicate through sign language while engaged in a task that requires visual attention elsewhere, such as hunting or foraging. We argue that the most significant advantages of speech are found in a unique confluence of two interacting circumstances: 1) the perceptual and motor components of speech are highly modularized into low-level perceptual and motor units (phonemes), thereby reducing sensorimotor task complexity; and 2) the relevant perceptual and motor systems are sufficiently mature at birth to promote sub-task learning in speech prior to learning in other domains.

In stark contrast to the human visual system, we are born with our auditory systems nearly fully developed and capable of perceiving the superset of six hundred consonants and two hundred vowels that span the world’s languages.10 Most individual languages utilize a reduced alphabet of around forty phonemes, so that rather than having to learn to perceive/produce an extended acoustic signal or each element of a superset of eight hundred phonemes, one must learn to perceive/produce expertly only forty elements.11

The reduction of speech into a small number of fundamental building blocks elegantly reduces the level of sensorimotor skill required to master speech. In particular, significant variability unavoidably corrupts the speech signal at both the perceptual and production ends, arising from motor noise, speaker dependence, interfering sound sources, etc. Yet the duration of a phonetic gesture is <100 ms, too narrow a time window for feedback to play an important part in the control strategy—as it does with other end-effectors. Without online correction, high performance can only be maintained by 1) cultivating robust categorical perception of phonemes across the phonetic landscape and 2) developing (through repetitive alphabetic use) feedforward expertise on producing individual phonemes.

Fortuitously, our perceptual and motor systems are sufficiently mature at birth for both types of learning to commence during the first months of life. Indeed, at six months, babies begin to categorically perceive phonemes in their native language and to lose perceptual sensitivity to acoustic features that define non-native phonemes. Canonical babbling and language-specific speech production occur at seven and ten months, respectively, well before other motor milestones are met. We therefore argue that babies begin the learning processes for single phoneme perception and production earlier than other comparable perceptual or motor behaviors. Given that brain plasticity is highest at birth and declines thereafter, the learning is likely to be particularly effective with an outsized portion of neural resources devoted to these related competencies. Of course, speech is about a lot more than single phoneme perception or production. But our main point is that early expertise in single-unit phoneme perception and production contributes significantly to later speech proficiency—note how foreign languages seem like an undifferentiated blur of strange sounds impossible to segment.12 Presumably, the baby’s brain excels at learning the prerequisites to the segmentation task; more generally, baby brains are capable of learning important sensorimotor sub-tasks of speech prior to other forms of learning, thereby helping to establish the primacy of speech as a mode of communication.

Language to Speech and Back Again

Once speech was widely employed by adults in one generation, the exposure of infants to speech led to greater speech facility in the next generation. We hypothesize that the resultant feedback cycle could proceed at an explosive rate, precipitating a broad “speechification” of the human race within a relatively short timeframe. If this hypothesis holds, then the mechanism by which speech evolved is unique. It did not proceed, like biological evolution, through a process of natural selection.13 Nor did it proceed like cultural evolution, where various forms of expertise continue to expand as one generation builds on the accumulated knowledge of previous generations. Rather, widespread use of speech in one generation biased—through mere exposure—the developmental processes of infants in the next generation, preconditioning their neural circuits to manifest an enhanced propensity for speech. In short, the speech faculty is given a head start by speech prevalence.

Chomsky noted early on that language embodies an agreed upon meaning-to-sound mapping.14 At some point, a singularity was reached: humans were capable of fashioning language, implying that an abstract system for computing with meaning existed and was mapped to sounds. In this sense, the internal language of thought had to precede its sensorimotor externalization, although how this internal language arose and what neural circuits compose it remain a profound mystery.15 We more tractably argue that though the computational structure for language may have initially arisen to support thought, its widespread communicative use triggered a transformative leap in our ability to command language, a second point of singularity from which human intelligence irrevocably diverged from its Simian relatives. Here we invoke a developmental argument using another golf analogy. Tiger Woods ranks as one of the greatest golfers. He clearly possesses sensorimotor gifts that other humans do not. That said, Tiger began watching the game of golf at six months, developed a swing by the age of two, and shot a forty five on a nine-hole golf course by the age of three. These facts in no way diminish his extraordinary innate capacity, but they do speak to the transformative power of specialization early in life. His brain became, in part, a neural circuit customized for depositing a small white ball into a small round hole using as few strokes as possible.16

Similarly, speech affords a wealth of opportunity for cultivating the precursors of linguistic skills during infancy, well before the language faculty has matured. A baby is constantly being bombarded with high-definition, multi-modal sensory rendering of the surrounding world out of which every mammal ultimately must make sufficient sense to survive—the “blooming, buzzing confusion” of James. While the infant is trying to use an inborn brain capacity to discern the most useful concepts/categories/rules by which to interpret the environment, his or her endeavor is ceaselessly punctuated by low-dimensional vocalizations which themselves have been deliberately preconfigured to correspond precisely to such useful demarcations. The sounds provide constant implicit pointers—at the levels of both speech and language—regarding how to go about conceptualizing experience. For example, certain phonemes tend to co-occur, certain words accord to certain objects, etc. Critically, these types of patterns are neither the essence nor end-product of speech/language, which is something much greater, but they do provide the structure necessary for facilitating unsupervised/reinforcement-based learning processes at the time when no other learning processes are as yet available.17 Accordingly, we propose that the internal faculty of language and its external actualization as speech mutually accelerate each other’s development.18 If language were not developmentally ubiquitous, the situation would be akin to Tiger Woods playing golf as a child without exposure to others playing golf—skill improvement would inevitably take place, but at a considerably sub-optimal rate.

Language across the Meaning Mapping

Returning to Chomsky’s contention that the internal faculty of language is not dependent on its sensorimotor actualization, we have offered support by showing that there exists a sensorimotor rationale explaining why speech should be preferred to signing. We have further suggested that sensorimotor constraints may also determine key structural features of sign language. Like Chomsky, we also believe that these types of differences are generally subordinate to a universal grammar, in the sense that language is first and foremost about the manipulation of meaning. Given that there is likely some universality across humans as to how meaning is experienced and some consistency as to how this meaning is represented in an internal language of thought, those structural imperatives will dominate, even as the internal language is mapped onto a variety of sensorimotor externalizations that differ within and across modality.

Even knowing what language is does not mean we understand how it works—i.e., the Galilean Challenge still remains. In the Grammaire générale, language is described as

the marvelous invention by which using 25 or 30 sounds we can create the infinite variety of words, which having nothing themselves in common with what is passing in our minds nonetheless permit us to express all our secrets… in effect, everything that we can conceive and the most diverse movements of our soul.19

How, Chomsky asks, is this remarkable feat—command over an infinite universe of expression with markedly finite means—accomplished via the free creative use of language? We do not know the answer. But the sense of helplessness we experience when confronting this question prompts a sense of déjà vu, through which we may be able usefully to analogize our ignorance.

Every sensorimotor circumstance confronting a motile organism is slightly different, because the space of sensorimotor possibility is infinite. Every time a predator chases a prey, the chase evolves differently. The golf shot described in the opening setion of this essay was not one Phil Mickelson could have anticipated. Yet animals and humans demonstrate a remarkable sensorimotor flexibility that no robot can even hope to replicate.20 How this adaptability arises remains a mystery, but a key feature of this intelligence is an ability to generalize—i.e., a uniquely biological capability to unconsciously bring to bear salient aspects of all previous sensorimotor experience for the purpose of improving future sensorimotor behavior in novel circumstances.21 Fluent communication via speech may not be that dissimilar. Every instant of our lives yields a different set of circumstances, prompting different thoughts. Language provides a commonly-shared mapping from concepts to words, together with a set of rules for organizing words to express more complicated meanings. Presumably these tools of language have arisen because they embody an effective cognitive infrastructure for extracting, manipulating, and rendering meaning. We then become remarkably adept at the skill of language through constant use—just as we do with other highly-practiced skills—until we reach a stage where we are rarely at a loss for words—i.e., we have generalized across the meaning map. The computational structure of language can be recovered through study, just as the correct mechanics of a golf swing can be analyzed, but the inner details of how this type of intelligence operates is, like sensorimotor intelligence, largely inaccessible to consciousness.22

The roboticist Hans Moravec remarked:

Encoded in the large, highly evolved sensory and motor portions of the human brain is a billion years of experience about the nature of the world and how to survive in it. The deliberate process we call reasoning is, I believe, the thinnest veneer of human thought, effective only because it is supported by this much older and much powerful, though usually unconscious, sensorimotor knowledge. We are all prodigious olympians in perceptual and motor areas, so good that we make the difficult look easy. Abstract thought, though, is a new trick, perhaps less than 100 thousand years old. We have not yet mastered it.

In the case of language, as distinct from abstract reasoning, perhaps the sentiment should be reversed. We are not merely Olympians when it comes to language, the most enabling invention in human history—we are all Tiger Woods. Like him, we too, at a ridiculously young age, comically wield an outsized tool in our hands, only to see the child transformed into an adult whose deft instrumental touch works seeming miracles. In this way, the Galilean Challenge reifies the larger magic of the brain’s knack for mastering complex behaviors; we learn mostly by doing and, when we have done something often enough to get good at it, we do it mostly without thinking. The whole thing seems rather mysterious.

Robert Ajemian and Emilio Bizzi

Robert Ajemian is a a research scientist studying the neural control of movement in the motor systems neuroscience lab of Emilio Bizzi at MIT.

Emilio Bizzi is an MIT Institute Professor, an Investigator in the McGovern Institute, and the Eugene McDermott Professor in the Brain Sciences and Human Behavior.

  1. For discussion, see Adam Kendon, “Reflections on the ‘Gesture-first’ Hypothesis of Language origins,” Psychonomic Bulletin and Review 24 (2017): 163–70. 
  2. Given the obvious predominance of visual perception in humans, where half of the cerebral cortex is devoted to visual processing, one might expect the opposite. See, for example, Robert Snowden, Peter Thompson, amd Tom Troscianko, Basic Vision: An Introduction to Visual Perception (Oxford: Oxford University Press, 2006). 
  3. Three factors pertain: the slower response time of larger muscles (20–50 ms. activation time constants with longer deactivation time constants), the need to compensate for interaction torques across joints, and the extent of joint excursion necessary to create visually distinct signals. See, for example, Felix Zajac, “Muscle and Tendon: Properties, Models, Scaling, and Application to Biomechanics and Motor Control,” Critical Reviews in Biomedical Engineering 17 (1989): 359–411. Also, John Hollerbach and Tamar Flash, “Dynamic Interactions Between Limb Segments During Planar Arm Movement,” Biological Cybernetics 44 (1982): 67–77. 
  4. Intrinsic finger muscles are small with relatively rapid muscle response times and the inertia of fingers is low. However, the alphabet of hand/finger postures requires orientating the wrist which, in turn, requires the recruitment of larger and slower forearm, elbow, and shoulder muscles (the entire complex composes a kinematic chain). See, for example, Ferdinando Mussa-Ivaldi, Pietro Morasso, and Renato Zaccaria, “Kinematic Networks,” Biological Cybernetics 60 (1988): 1–16. 
  5. It is generally used a supplement for other sign languages. See, for example, Wayne Forman, “The ABCs of New Zealand Sign Language: Aerial Spelling,” Journal of Deaf Studies and Deaf Education 8 (2003): 92–96. 
  6. Edward Klima and Ursula Bellugi, The Signs of Language (Cambridge, MA: Harvard University Press, 1979). 
  7. Edward Klima and Ursula Bellugi, The Signs of Language (Cambridge, MA: Harvard University Press, 1979. 
  8. Viewers must attend to multiple gestures in parallel and keep aspects of previous gestures in working memory. 
  9. We emphatically depart from the standard view that sign language possesses a true phonology—i.e., an underlying basis of a finite number of meaningless, contrastive units that together compose morphemes. See, for example, Wendy Sandler, “The Phonological Organization of Sign Languages,” Language and Linguistics Compass 6 (2012): 162–82. We do not deny that one can decompose signs into component parts such as sub-movements, hand configurations, locations, etc., to help distinguish them. The question is whether the decomposition is meaningful. For signing, the motor command necessary to implement one of these sub-components depends on the values of the others due to the transmission of torques along a kinematic chain—i.e., the command signal for motor phonemes depends critically on movement context. Thus, there is no pseudo-invariant motor command that corresponds to a phoneme in signing, unlike in speech—where co-articulation effects are minor by comparison and compensated for by categorical perception. This realization is a simple matter of biomechanics, not interpretation. Our view is that a phoneme is functionally real if and only if it corresponds to a pseudo-invariant motor-to-sensory mapping, enabling the unit to be chunked at the neural level as a distinct production-perception association. For support, we note that developing sign languages may lack even a presumptive phonology—see Wendy Sandler et al., “The Gradual Emergence of Phonological Form in a New Language,” Natural Language & Linguistic Theory 29 (2011): 503–43. 
  10. The visual system requires 6–8 months to attain near adult levels with respect to most visual functions, and full visual acuity does not arise until after the age of two. See, for example, Martin Banks, “Infant Visual Development,” Acta Psychologica Sinica 17 (1985): 271–77. 
  11. The number of phonemes, ~40 in most languages, presumably reflects a trade-off between having as few phonemes as possible and minimizing word length. If there were only two phonemes (e.g., Morse code), word length would be unacceptably long. Our point is that sign language does not participate in this tradeoff or reap the corresponding sensorimotor benefits of modularity because the fundamental unit of a sign exists at a larger scale. 
  12. In fact, adults who have learned a second language have great difficulty in ever comprehending or speaking the language at the same rates as native speakers. 
  13. Nothing prevents natural selection from operating in parallel, which it likely does, though the restricted time frame places limits on the degree of evolutionary change that is possible. 
  14. See, for example, Noam Chomsky, “The General Properties of Language,” in Brain Mechanisms Underlying Speech and Language, ed. Frederic Darley (New York: Grune and Stratton, 1967). 
  15. We are agnostic as to whether these evolutionary changes to the brain were specific for language or whether they embodied more general cognitive strategies that were to a large extent appropriated for language. The most likely answer is a combination of the two. At a general level, extracting greater levels of meaning from the environment as the content for the internal language requires greater levels of feature abstraction (more association cortex) and storing those features for longer period of time (larger temporal lobe for memory).  Both types of brain changes have been observed as humans evolved, but these non-specific changes hardly suffice to answer such a difficult question. 
  16. The mammalian brain, as the ultimate adaptive system, excels at self-organizing, including re-allocating neural resources, so as to improve performance in the tasks in which it tends to be engaged. For fine motor sports like golf or tennis, if extensive practice has not begun before the teenage years, an individual’s chances to succeed at the highest level are greatly reduced, if not eliminated altogether. 
  17. In neural network/machine learning theory, supervised learning is the most efficient form of learning, and it requires a teaching signal against which an output can be quantitatively compared for system improvement. Early in life before a critical mass of understanding has been reached, this type of learning is less frequent. 
  18. The patterned low-dimensional vocalizations not only help infants parse experience, but also provide added incentive to do so by triggering the imitation instinct. Through exploratory behavior, the infant discovers a capability of reproducing facsimiles of these sounds in a contextually correct manner, thereby cultivating a sense of control over the “blooming, buzzing confusion.” 
  19. Antoine Arnault and Claude Lancelot, Grammaire générale et raisonnée de Port-Royal (Paris: Munier, 1803). First published in 1660. Translation by the editors. 
  20. Robots can be designed to perform a stereotyped behavior, with little ability to adapt to unforeseen contingencies. Biological sensorimotor intelligence is, in contrast, amazingly adaptive. 
  21. We do we do not mean generalization in the highly limited, almost vacuous, sense of a deep neural network, an ability which bears virtually no resemblance to the generalization we are describing in either the sensorimotor or speech/language domains. 
  22. Here we are referring to the well-known distinction in neuroscience between declarative memory and implicit or non-declarative memory.  See, for example, Larry Squire and Stuart Zola, “Structure and function of declarative and nondeclarative memory?systems,” Proceedings of the National Academy of Sciences 93 (1996): 13,515–22. Both golf and speech are implicit skills: an articulate speaker can no more tell a less expressive speaker how to be more articulate than a professional golfer can tell a novice how to hit a golf ball. You can give tips and pointers, but the knowledge only comes from doing.