To the editors:
Sign languages arise spontaneously in deaf communities, are acquired during childhood through normal exposure without instruction, and exhibit all of the facets of complexity found in spoken languages. Yet if language evolved without respect to modality, whether spoken or signed, we would find hearing communities that just happen to use sign language rather than spoken language, but we do not. What, then, can be learned from naturally occurring sign languages? Sign demonstrates what is impossible in speech, and speech, what is impossible in sign. Comparing spoken and signed languages helps to isolate and distinguish both the roles of the body and computational limits in determining the form of language in both modalities.
While we agree with Iris Berent that spoken and signed languages belong to one human language system, we maintain that each modality organizes the system differently. Spoken and signed languages differ because of both the affordances of the human body and the limits of the medium of transmission. The differences mean that simply projecting attested patterns from one modality to another, as in Berent’s approach, is not sufficient to show abstract or algebraic universals. One must instead situate such abstract properties in a theory that determines the range and types of possible linguistic processes for a language user. This is exactly what the mathematical theory of computation provides. Understanding which classes of computations are possible in both speech and sign allows for precise analysis of the similarities and differences in the types of processes allowed by the language system as a whole. The universality of computation also protects theories from being speech-centric. It ensures that they are flexible and adaptable to discoveries about both speech and sign, the only two natural human language systems.
In 1960, William Stokoe demonstrated that the signs of sign languages are made up of discrete meaningless building blocks. These are recombined to create large vocabularies in the same way that sounds are combined in spoken languages.1 This discovery was a milestone. But the building blocks—the phonology—of each type of language are so powerfully shaped by the physical channel of production and perception that the rule systems of the two are orthogonal. In contrast to the position taken in Berent’s article, sign language research shows that the body is intrinsic to the mental system of language, and not external to it.2
Signs consist of features that distinguish one sign from another. The Israeli Sign Language (ISL) signs for THINK and LEARN are distinguished only by their different handshapes. For THINK the index finger moves toward the temple, and for LEARN the fingers, pressed together against the thumb, make the same movement to the same location.3
These two sign words are distinguished by meaningless components, index finger handshape vs. closed fingers handshape—comparable to fog and dog in English, which are alike except for their first meaningless sound, [f] and [d]. Sign words resemble spoken words at this abstract level of minimal contrasts. Still, many aspects of phonological structure are profoundly different in the two modalities. All known sign languages typically require a simple word to be formed of a single handshape, movement, and major location.4 This is presumably because the body can produce discrete simultaneous articulations and the visual system can perceive them, within the constraints of working memory.5 Spoken languages, by contrast, vary widely in their phonological structure. Chinese words are primarily two syllables long, while Hawaiian words can consist of many syllables, as in the name of the state fish of Hawaii, humuhumunukunukuāpuaʻa, or triggerfish. Hawaiian syllables are very simple, mostly consisting of a single consonant followed by a vowel. By contrast, Polish has jaw-busting words like Szczebrzeszyn, the name of a city in the southeast of the country.
Forms are altered by context in both signed and spoken language, irrespective of meaning. In English, the [n] in the compound sunblock may be pronounced as [m] by closing the lips, influenced by the following closed-lip sound, [b]. The result is the pronunciation su[m]block. This process, known as assimilation, has nothing to do with the meaning of sun or block, but only with the form of the articulations themselves. Assimilation is an important piece of evidence for organization at the phonological level of structure, since form and not meaning determines organization.
Sign languages also contain form-driven assimilations, following different rules according to the body’s capacity for producing and perceiving movement. The handshape of the ISL sign THINK is altered to resemble the handshape of STOP when the two come together in the compound word THINK^STOP, which means STUNNED.6 The sign THINK is signed with an extended index finger moving toward and contacting the temple. To sign STOP, both hands, shaped with both index and middle fingers extended, begin beside the cheeks and then move down. In the compound sign STUNNED, both hands use this two-finger handshape from STOP, contact the head as in THINK, and move downward. The result is a new sign with a single movement and handshape: the canonical shape of the sign. This assimilation is not due to any characteristic of the neighboring handshape—as in the English su[m]block example—but to constraints within signed languages requiring the word optimally to have only one handshape and only one movement.7 Any formulations of the English sunblock > su[m]block and ISL one finger > two fingers and move to head > move downwards would be substantively so different that no rule could be conceived of in such a way as to apply to both language types.
Though there is a phonological level in each type of language, in contrast with Berent’s claim, the rules for combination and alternation are substantively distinct. These rules are determined by the body and its medium of expression: visual versus acoustic. What the language types and their rules share are the abstract phenomena of contrastive features and assimilation.
Stokoe’s proposal that there are three major feature categories in signs—handshape, location, and movement—contrasts with the two major categories of spoken language, consonant and vowel, not only in number, but also in combinatorial possibilities. Spoken language consonants combine linearly with other consonants and with vowels, as in the word consonant: CVCCVCVCC. Sign language words typically contain only one instance of each major category and no continuous strings of just one element. THINK has the extended index-finger handshape, straight movement, and the head as the major body area or location. The elements of phonological form and its perception are implicated in both speech and sign. The two types of languages are made up of completely different materials, and the materials drive their form. It is then no surprise that constraints on the combination of phonological elements and possibilities for alternation are also substantively different.
In spoken language, there are constraints on how many consonants are allowed to cluster together, or on how similar all vowels may be in a word. These differ from language to language. In sign languages, there are constraints on how the two hands can interact,8 on how the fingers and their shapes can combine,9 and on how many movements can occur in sequence within a sign.10
These generalizations distinguish sign languages from spoken languages. They also hold for every established sign language that has been studied. The generalizations described here are inextricably bound to the body. The body drives phonological form, in contrast to Berent’s claim that this form is the domain of a disembodied mind. Even the idea that phonology is a universal and unified property of all languages is challenged by the findings concerning a new sign language that emerged in relative isolation among Al-Sayyid bedouins. Wendy Sandler et al. discovered that communicative and functional language can begin without a crystallized phonological system.11 Phonology as a system is not there at the outset but takes time and interaction in a community to develop, as shown also by the emergence of tactile sign in a community of the DeafBlind.12
The forms of complex words also show parallels in signed and spoken languages. But the differences, due to the body, are just as striking. Both types of language have complex words. Yet the differences in their kinds of complexity preclude inferences from one language type to the other. Signed languages have inflectional morphology, which depends on sentence structure, and some derivational morphology, which forms new words. But the system is not isomorphic with that of spoken language.
Mark Aronoff et al. showed that inflections in signed languages are highly influenced by iconicity as exploited by the body.13 A type of inflection found in many sign languages is verb agreement.14 In verbs of transfer, the beginning and endpoint are determined by points referring to the source and the goal of the transfer.15 From the verb show, I SHOW YOU is conveyed by moving the hands from the signer toward the addressee; YOU SHOW ME moves in the opposite direction; and I SHOW ALL moves in an outward horizontal arc in front of the signer.16 These characteristics are common across sign languages and unknown in spoken languages.17 Although many spoken languages exhibit verb agreement, agreement applies across the verbal system, and is not restricted to verbs of transfer in any known spoken language. This distribution and the rules for implementing it become clear once we acknowledge the role of spatial iconicity in a system recruited by the body.
In light of these results, how does the search for grammatical universals proceed? In the 1930s, Alonzo Church and Alan Turing were seeking a resolution to a crisis in mathematics about proving mathematical truths logically consistent. They uncovered a profound separation between computable and noncomputable functions.18 The sense of the word computable in Turing’s usage predated actual modern computers and referred to intellectual behavior. He, and others, showed that there exists a universal, algebraic natural category or computational system, which may be studied irrespective of the physical details.
The Chomsky–Schützenberger hierarchy decomposed the computable functions into nested subclasses of increasingly restrictive and less complex types of computation.19 Each class corresponded to a restriction on the computing device, describing the type of grammar or machine that could uniquely compute the class. The more restrictive the computational system, the less it is capable of describing.
Since it was possible to precisely define how powerful a discrete mathematical system was, one could then ask what level of computational expressivity would be necessary and sufficient to describe, learn, or use the processes in human language. “[T]hese are all theories of proliferating systems of structures,” Noam Chomsky remarked, “designed for the problem of trying to locate this particular structure, language, within that system.”20 The rigor of this approach has earned it a place as a cornerstone of computer science and cognitive theory.
Which levels of computation are necessary and sufficient to properly characterize linguistic processes for spoken and signed languages? A further question is whether the spoken and signed languages fall into the same class, and thus need the same cognitive power. This is a much broader and yet more rigorous question of amodality than the one Berent asks. It goes beyond the imprecise question of whether the particular rules of the two types of language are somehow similar, asking instead whether a certain rule is evidence that signed language needs access to types of computational power that characterize it.
Consider word doubling of the sort used in the experiments reported by Berent. Word doubling is an example of a general copying function w →ww, which takes a sequence and maps it to its double, as in Indonesian plurals: ‘buku’ → ‘bukubuku’ (book → books). Copying is a computable function, but it does not need the full power of computability.21 Nor do any of the substantive variants in Berent’s experiments, such as copying only certain parts of a word (‘buku’ → ‘bukuku’).
Doubling inhabits a highly restricted formal class known as the regular functions, one of the best-understood concepts in computer science and discrete mathematics. For a process to be regular, the memory required to compute it must not grow past some fixed, finite bound. This subsumes any cognitive processing mechanism in which the amount of information inferred or retained is limited by a fixed finite bound.
Not all copying is a regular function. We can think of infinitely many possible but unattested types of copying besides doubling. An exponential function w → w|w|copies a word as many times as the word’s length, which would make Indonesian plurals look like ‘buku’ → ‘bukubukubukubuku.’ The word is repeated four times because it contains four sound segments. If it had five, it would repeat five times. This function is provably supra-regular and is unattested in any human language, signed or spoken.22
A stunning discovery is that the entirety of attested human phonology and morphology for spoken languages sits comfortably in the regular functions, with the overwhelming majority restricted to the less powerful sub-regular classes.23 Laboratory experiments show that human participants readily learn patterns drawn from these classes and consistently fail to learn patterns outside them, even when there are plausible alternative hypotheses consistent with the constraint-based theory Berent’s study adopts.24 In Berent’s terms, humans may have an instinct or bias to learn sub-regular patterns and not more complex ones. This computational foundation allows for a refined version of Berent’s experiments. Can speakers and signers learn some regular function present in the other modality, but not their own? Does the modality affect whether the speakers can learn patterns outside the regular class? The main lesson is that these precise, restrictive characterizations allow for a precise statement of the generative capacity of linguistic cognition such that it can express these functions.
This mathematical foundation provides a solid base from which to compare spoken and signed phonology. Our work has shown that many signed phonological and morphological processes inhabit the same classes as their spoken counterparts. This finding leads us to an intriguing hypothesis: Any divergences in generative capacity between the two modalities may reflect the difference in physical systems. If the generative capacity of signed reduplication differs from spoken reduplication, the difference reflects a unique characteristic of the signed modality.25 One example of this greater capacity might be embedded copying in American Sign Language.26
An important aspect of this research clarifies the relationship between mental representations and generative capacity. Spoken and signed modalities both contain sequential and simultaneous structure, but they differ in how central each is in the creation of forms.27 Sign privileges simultaneity, while speech privileges sequentiality. Advances in mathematical linguistics have begun to examine the relationship between simultaneity and generative capacity in spoken and signed processes. The apparent difference between the signed and spoken representations is likely due to a more abstract constraint preserving their computational limits. Signed processes, which are able to exploit simultaneity because of the affordances of the body, may constrain their sequential properties to maintain the sub-regular limits on generative capacity.28
The main lesson is that the capacities for speech and sign provide a window into the nature of the universal capacities for language shared by all humans, even if one modality does not manifest them directly. A formal envelope of mathematical universals allows principled distinctions and similarities to be drawn between spoken and signed cognition and is predicted to do so even in the emerging communities of tactile language where the substantive restrictions are still unknown.
Mark Aronoff, Jonathan Rawski, and Wendy Sandler
Iris Berent replies:
Polarization is rampant these days, and language researchers are not immune. In linguistics, as in cognitive science, party lines run along the abstraction–embodiment debate, a distant vestige of the bloody nature–nurture wars.
Mind–body in this case does not refer to Descartes’s dualism. Indeed, cognitive science assumes that all mental states are brain—i.e., bodily—states. Whether endorsing abstract rules or denying them to exist, cognitive scientists are committed to the claim that all causes of behavior lie within the human body. Quibbles about whether cognition is embodied, then, are misdirected. What these debates are really about is whether some causes of cognition lie “below the head,” so to speak: whether cognition is linked to sensory and motor bodily functions.
In its extreme form, the embodiment position reduces cognition to sensorimotor constraints; it asserts that there is no such thing as abstract concepts and representations.29 The brain has no symbol for cup in general. Instead, the notion of a cup amounts to the sum of a person’s sensations of specific cups and motor interactions with them—how a cup feels, its smoothness, roundness, coldness, and its color as registered by the eyes.
By the same token, a person’s knowledge of language structure is governed not by algebraic rules but by sensorimotor restrictions. Why do we blog, for example, rather than lbog? Lbog is banned because it is harder for the mouth to utter and for the ear to discern. Since our appreciation that lbog is difficult only arises once our lips and tongues utter these sounds, it would seem that these embodied restrictions on language structure arise from experiences alone.
Whether language is in fact embodied has been a topic of active debate in linguistics and cognitive science.30 But unfortunately, the notions of embodiment and abstraction are rarely spelled out. The passionate defense of embodiment by Mark Aronoff and his colleagues does little to correct this problem—they never explain what embodiment really means. When basic concepts are blurred, and the blood gets boiling, positions on this nuanced question can get needlessly polarized.
In the heat of battle, allegiances must be clear-cut. If a researcher happens to conclude that speakers follow abstract rules of language, then he is automatically seen as asserting that these rules are entirely senseless and arbitrary; they can serve no functional purpose with respect to the transmission of language by the human body. As an extra bonus, some might falsely accuse him of stating that these rules must be innate. Conversely, if he believes that the design of the language system is adaptive, then, strangely enough, he is often seen as stating that language is the product of nurture, that rules are simply fruit of the imagination, and that language is entirely governed by the whims of the body.
This is not a good conversation starter. In the interest of opening up a more productive discussion, it is important to clarify these issues.
Aronoff et al. are right to ascribe to me the position that some rules of language, including some rules of phonology, are algebraic, amodal, and abstract. They are wrong to conclude that from this it follows that the design of language, and especially of phonology, is entirely arbitrary—that embodied constraints play no role. I assume that is what they mean by their assertion that “the body is intrinsic to the mental system of language, and not external to it.”
How can one have it both ways, you wonder? The answer lies in three critical distinctions—scope of one’s argument, cognitive causation (proximal vs. distal), and grain size of linguistic units.
I assert that some rules of language are amodal, inasmuch as they apply to both speech and sign. When naive English speakers, for instance, who know nothing about a sign language see signs for the first time, they can apply some of the rules from their spoken language to extract the linguistic structure of those signs.31
I do not assert that all linguistic rules can transfer across language modalities, and I certainly recognize that many rules of phonology differ across modalities. A phonological restriction on voicing, for instance, has no business applying to manual signs. This much should be obvious. Claims to the contrary indicate confusion about the scope of my assertion.
Nonetheless, I do assert that, at least in one case—that of doubling—the relevant rule is amodal. This discovery is significant, because it demonstrates that the computational machinery of language is algebraic.32 The operation Y = 2X refers to a class, where X is any integer, as opposed to specific integers such as 1, 2, 3…. Similarly, the restriction on doubling—e.g., “avoid XX,” where X is any syllable—refers to a broad class, “any syllable,” rather than to specific instances such as dada or baba. As long as an English speaker can spontaneously recognize a signed syllable, which they demonstrably can,33 and represent the formal function of identity, XX, which even newborns do,34 the doubling rule from the English grammar ought to freely apply to American Sign Language signs, as indeed it does.35
In sharing this finding as part of my essay, I sought to counter the popular belief that language, and especially phonology, is all about talking: that one’s linguistic intuitions are determined solely by the mechanics of the lips and tongue. This folk phonology runs so deep that people often take it for granted. The counterintuitive phenomenon of cross-modal transfer debunks this myth.
This proposal also counters the claims of Aronoff et al. in two important respects. First, I do not vaguely claim that rules in the two modalities are somehow similar, as these authors suggest. I am committed to the strong claim that, in the case of doubling, the rules are one and the same. Second, since algebraic variables encode structure (XX), not speech instances, e.g., baba, the rule treats manual signs and aural speech alike. This seems to fly directly in the face of the claim by Aronoff et al. that body shapes language structure. But in reality, it may not. To explain why, we need to take a closer look at how the body can play a role in cognition generally, and the language system specifically. Here, the distinction between proximal and distal causation is critical.
The algebraic hypothesis defines proximal causes of linguistic intuitions. In this view, people state that laflaf sounds funny because laflaf violates the *XX rule—the rule, then, is a proximal cause of their intuitions. Still, why do languages adopt such a rule in the first place?
Here, embodiment could matter distally. Repetition taxes the perceptual and motor systems,36 so it stands to reason that repetition in language will be highly regulated. These bodily constraints could provide the impetus—a distal cause that determines what rules make it into universal grammar in the course of language evolution. Distal bodily pressures could beget universal grammar, which, in turn, could beget the doubling rule, which drives speakers’ intuitions. Crucially, once a rule is adopted into a grammar, it is now the rule that is doing the talking. As such, it is the rule, not bodily pressure, that is the proximal cause of speakers’ intuitions. Still, bodily pressures matter distally.37
And of course, going beyond the grammar, the language system could also include nongrammatical analog mechanisms that are heavily embodied, and those can demonstrably shape speakers’ intuitions. The phonetic system is a case in point.
Phonetics is a transducer. For instance, spoken language phonetics takes speech, which is analog and continuous, and outputs discrete categories, such as the voicing distinction between bee and pee. A large literature shows that this process is highly embodied: when the brain seeks to determine what a person hears—bee or pea—it simulates how she would articulate these speech sounds. When people hear bee, they activate the motor area of the brain that controls the lips; when they hear tea, it is the area that controls the tongue that fires.38 And when these areas are stimulated by transcranial magnetic simulation, listeners’ perception of these sounds changes accordingly.39 Work from my lab has shown that one’s perception of voicing in pea and tea changes even when one slightly presses on one’s lips, relative to pressing on the tongue.40 All these demonstrations show that the articulatory motor system plays a direct causal role in the phonetic categorization of speech sounds. And here, the body is the direct proximal cause of perception, not merely a distal one.
But what is true within phonetics—outside grammar—may not be true for the phonological rules within it. Here, one can manipulate the lips all one wants, either directly, by pressing on them, or by stimulating the lip motor area via transcranial magnetic simulation; in any case, the application of phonological rules concerning syllable structure remains unaffected.41 This, to clarify, does not show that phonological rules are arbitrary. As noted, the body still plays a role. But this role of the body is apparently a distal one.
That language comes at different grain sizes—for example, phonetics versus phonology—which can differ with respect to their level of embodiment, should come as no surprise. Likewise, it should only be expected that within the grammar—say, in phonology—some rules might be incentivized by distal bodily pressures. This makes perfect sense, given that phonology has a double duty to follow: it needs to generate novel forms by combinations, but it also needs to transmit them by relying on the human body. The solution is to favor rules that make bodily sense—this is exactly what would be expected of an adaptive system.42
This proposal allows for the possibility that some algebraic rules could apply amodally, whereas others could differ across language modalities, just as Aronoff et al. point out. Whether these modality differences could further shape the computations that are attested in speech and signs, as Aronoff et al. suggest, is an interesting question—and remains to be seen.
How the language system arose in humans is another critical question that I will not consider here. Aronoff et al. submit that, as it does not emerge fully de novo, phonology cannot be innate in the first generation of emerging sign languages.43 I am not sure this follows. Innate systems need not be fully assembled at birth, immune to epigenetic triggers. Birdsong shows how a quintessentially innate system of communication could emerge gradually, across generations, through complex interactions of nature and nurture.44
The nuanced view of embodiment and innateness that I have painted here has some concrete methodological implications. When one seeks to evaluate the role of embodiment and abstraction in the language system, one ought to proceed with caution. As noted, embodiment can play distinct roles at different levels of analysis, and these roles can be either proximal or distal. When Aronoff et al. outline the undeniable correlations between the design of language and bodily pressures, it is irresponsible to jump to conclusions about causation—specifically the conclusion that the body is internal to the language system, presumably as a proximal cause of language structure. As we know too well, correlations and causations are not one and the same. So rather than simply ask, “Is language embodied?” the more appropriate question is “How?” It is time to move beyond the party line and do the hard work of sorting this out.