To the editors:
A hallmark of human language is its combinatorial nature, which facilitates the communication of infinite meanings and messages. First, a limited set of meaningless sounds, or phonemes, can be productively combined into large lexicons of meaning-bearing units: morphemes and words. Second, according to a language’s syntactic rules, the resultant meaning-bearing units can be assembled into larger sequences and phrases to communicate more complex and novel messages.1 The accumulating data on nonhuman animals demonstrate that modest combinatorial vocal structures also exist in the communication systems of distantly related species.2 Such findings suggest converging evolutionary drivers along with shared social and ecological conditions that might have led to the emergence of basic combinatorial abilities. Studies on animal combinatoriality,3 particularly empirical work on birds, have evoked heated discussion among linguists. Some researchers doubt whether data on nonhuman animals is comparable to the linguistic structures found in human language and question the extent to which such data is informative about the evolutionary origins of the human language faculty.4
In his essay, Riny Huybregts takes particular issue with findings that chestnut-crowned babblers (Pomatostomus ruficeps) demonstrate combinatorial reuse of meaningless sounds to generate meaningful calls. He discusses how the bird case differs from the ability of humans to generate meaningful words from a finite set of meaningless phonemes. Chestnut-crowned babblers are highly social, group-living passerine birds from outback Australia. They produce around eighteen functionally distinct calls, including multi-element calls that are made up of smaller acoustic elements, some of which reoccur across different call types.5 Of these, two calls are of particular interest in recent studies: AB flight calls, which are produced during short flight movements and serve as a contact call to coordinate group cohesion, and BAB prompt calls, also termed provisioning calls, which stimulate nestling begging behavior during food provisioning.6 A series of experiments conducted on wild birds in aviaries demonstrated that the individual A and B notes played back on their own did not result in differential or other relevant responses in birds. These standalone calls can be considered as meaningless and not encoding functionally or contextually relevant information.7 Habituation-discrimination experiments, which test a subject’s ability to spontaneously discern differences among presented stimuli, further showed that birds discriminated among the A and B notes of each call. They perceived the same notes compared across calls as equivalent, in turn suggesting the reuse of the same sounds across the two calls.8 Related experiments confirmed that the equivalent note types can be exchanged between calls, with the resulting synthetic calls still being perceived appropriately by babblers.9 Together, these studies indicate that an ability to generate meaningful units from recombination of smaller meaningless building blocks is not limited to human language, but has also evolved in a very distantly related species.10
While Huybregts acknowledges that examples from nonhuman animals can provide insights into the evolutionary beginnings of combinatorial capacities in human language, he points out several differences that arguably complicate comparisons. Huybregts considers that there is an intrinsic link between language’s two combinatorial layers—phonemes combining into morphemes and words (phonology), and words combining into phrases (compositionality). He designates this duality of patterning to be a defining feature of language, which implies that systems lacking either one of these layers cannot provide any comparable insights into language’s combinatorial system. There is no question that the generative power of most human languages is the product of both of these combinatorial layers acting simultaneously to facilitate the expression of infinite messages from the recombinatorial and recursive use of meaningless and meaning-bearing units. Yet, their co-occurrence does not imply that both layers necessarily have a single, “recent and abrupt” evolutionary origin, as Huybregts suggests, or even the same neural basis or selective drivers. Diverse work, including studies on language acquisition, emerging sign languages, and computational modeling, along with the clear combinatoriality of music, which lacks duality of patterning, suggests that these two layers can emerge independently from each other.11
Huybregts claims that the chestnut-crowned babbler system cannot bear comparison to language-like phonological structures since it lacks the capacity to merge meaning-bearing units. But, given that language’s combinatorial layers can emerge independently, this is no argument against this particular combinatorial system being comparable to the way in which humans recombine smaller, meaningless sounds into meaning-bearing units. Nor is his argument that there is lack of coarticulation in these calls correct: this experiment has simply not been done.
Other animals, including another babbler species, the pied babblers (Turdoides bicolor), do combine meaningful calls into call combinations that encode related and derivable meaning.12 And many birds incorporate call units, which have standalone meanings, into their highly combinatorial songs.13 Recent advances in machine learning methodologies can be used for segmenting animal calls and testing for reoccurring elements across call types. It would be unsurprising if these technologies demonstrate similar, superficial recombinatorial element reuse in the vocalizations of other species, including those already demonstrated to possess composition-like capacities.14
The duality of patterning issue aside, a major difference between the chestnut-crowned babbler sound combinations and the phonological layer of human languages is the lack of productivity in the babbler system. In the latter system, two sounds give rise to just two signals when theoretically 14 could be generated—i.e., 2 one-symbol + 4 two-symbol + 8 three-symbol codes. This dichotomy was already highlighted in the original babbler studies, which postulated that the evolutionary emergence of the babbler combinatorial structures may not be driven by the need for “unbounded and creative” communication, as Huybregts puts it, but instead by a need for reliable signal transmission and perception.15 Huybregts raises similar considerations, which we are currently following up in our own research. More precisely, research in game theory and computational linguistics suggests that the sound combinations of language’s phonological systems may not function to increase the communication output of humans, but instead to enhance its transmission efficacy.16 In line with these findings, mathematical models on the emergence of phonological forms indicate thresholds above which adding new sounds to a sound repertoire would incur costs acting on the discriminability between signals. Crucially, these models build on information theory, specifically Claude Shannon’s noisy coding theorem, postulating that the presence and amount of noise—e.g., environmental factors interfering with signal perception—influences the successful recognition of individual sounds. As a consequence, stringing simple sound units into more easily distinguishable sound arrangements can enhance discriminability among otherwise similar-sounding signals.17 In the case of chestnut-crowned babblers, this suggests that if each sound unit, A and B, alone were assigned to one of the two babbler call meanings, though they would be discriminable in quiet conditions, their acoustic discriminability might blur in the presence of noise, resulting in perception errors. Combining the sounds into more distinct combinations—e.g., AB and BAB calls—creates signal trajectories in time and the acoustic space that are more reliably discriminable when noise hampers signal transmission and perception.18 The babbler signal pair should be maximally contrastable,19 and, for transmission efficiency, shorter codes should be applied to the more frequent signal.20 Both are true for the babbler’s AB flight and BAB prompt calls. Different corresponding sounds at each of the calls’ positions and dissimilar signal lengths result in reliably distinguishable sound combinations. The more frequent, shorter AB flight call is incessantly emitted during aerial movements, but the longer BAB prompt call is used only temporarily during periods of nestling feeding.21
The babbler system’s lack of productivity does not prohibit comparison with linguistic phonology, but instead offers new perspectives on the factors that could have kickstarted the evolutionary progression of phonological systems in the absence of productivity. Specifically, the sound combinations of babblers demonstrate a very simple case in which meaningless sound elements are reused to create meaning-bearing sound combinations—a combinatorial operation that shows similarities to the way humans recombine phonemes into meaningful morphemes and words. The computational underpinnings of this operation are, at this point, unknown, but are of secondary interest at this phenomenological, descriptive level. Huybregts claims that these calls are “limited, fixed, and stimulus-bound” and potentially even “involuntarily produced and instinct-controlled,” but many birds have been shown to have voluntary control over their vocalizations, and to be able to learn to use them discriminatively in an operant setup.22
Of course, it is possible that babblers are insensitive to the calls’ combinatorial nature and perceive and store these calls as holistic units. Both child language acquisition data and computational models suggest that humanlike, productive phonology was preceded by a stage in which language users were unaware of the combinatoriality of signals. This stage occurred before the emergence of perceptual and learning strategies eventually enabled the productive exploitation of the combinatorial operation.23 The chestnut-crowned babbler research implies that no elaborate, humanlike cognitive abilities are needed for the emergence of basic combinatorial structures like that of the babblers or early human phonological systems. Instead, their emergence might be driven by universal constraints acting on signal perception, constraints present in potentially any communication system where selection has created the need for effective and reliable signal transmission and information transfer.
The differences in the extent and type of productivity and computational complexity in human and animal combinatorial systems should not discourage linguists from investigating similarities and converging features. In fact, Huybregts and others successfully and repeatedly engage in the enterprise he claims should not be possible: comparing combinatorial structures in animal and human communication systems.24 Despite their differences, there exist undisputable similarities. And it might be similarities at the simplest level that provide the deepest insights into the selective factors, universal principles, and shared environmental constraints that can drive the emergence of rudimentary combinatorial operations in the human language. Huybregts’s forceful rejection of the babblers as a model system for phonological emergence in human language is unwarranted. The evidence relevant to language evolution is scarce enough even when scholars do not prematurely reject the many experiments of nature that evolution has provided, simply because they do not precisely match the details of human language.25
Sabrina Engesser & W. Tecumseh Fitch
Riny Huybregts replies:
Sabrina Engesser and W. Tecumseh Fitch fervently propose that the natural experiments that evolution has provided should not be prematurely rejected simply because there is no precise match with the details of human language. I cannot agree more. But the details in this discussion concern basic properties of human language that have no analogs elsewhere in animal communication systems.
The capacity to combine meaningless tones into meaning-bearing calls or of assembling simple meaning-bearing calls into novel, more complex meaning-bearing calls may suggest similarities to commonsense notions of language. But without proper discussion of how these notions have a clear and explicit basis in well-motivated theoretical constructs, resemblances that arise in different ways, suggesting duality of patterning, are of no consequence.26 What, then, are the basic properties of human language that define its phenotype? Combining meaningless phonemes to yield meaningful morphemes or words should have no evolutionary roots in genetic recombination simply because DNA bases coding for different amino acids or proteins would be analogous to combining speech sounds to denote different meanings.
Engesser and Fitch’s argument for finding roots of human language in animal communication systems seems to be based on the idea that, analogously to animal vocalization, language is speech with communicative use. But language is neither speech—because internal language can be spoken, signed, or haptic—nor communication—though externalized language can be used this way, or for much else. Externalization is ancillary to language, which is essentially a computational operating system. Specifically, language is an internal system for generating structured thought, constrained by computational resource restrictions for biological organisms. Speech and sign are simply input-output channels, clearly not necessary and obviously not sufficient for language. A recently explained result, structure dependence—i.e., the idea that communicative efficiency is invariably sacrificed for efficiency of computation—plays a central role in this argument.27
Language is a computational system coded in the brain, a generative procedure, that recursively generates an infinite array of hierarchically structured expressions, each expression formulating a thought, and each expression potentially externalized in some sensorimotor medium.28 This is the basic property (BP) of language. Its basic structure-building operation, binary Merge, is the simplest such recursive operation. Merge takes two elements X,Y, which may or may not have been constructed by a previous Merge application, and constructs from them a new element Z, the set {X, Y} containing just X,Y without changing or adding anything or imposing any linear order or other arrangement on them. Merge is essentially set formation. It follows that Merge-based language is a system of hierarchical structures, not a system of concatenation-based linear strings. For principled reasons, language is structure-dependent and ignores linear order. This observation is key in proving that the rewrite systems of the Chomsky hierarchy are irrelevant for explaining human language.29 Rewrite systems may work for externalized language or animal vocalizations but fail to explain structure dependence of internal language, a Merge system.
Structure dependence of human language is widely misunderstood in the computational linguistic and comparative biological literature. In particular, it is misunderstood in a recent review of so-called animal syntax by Toshitaka Suzuki and Klaus Zuberbühler, two leading biologists in animal communication studies.30 The authors suggest an “interesting amendment” to the computational operation Merge of natural language that yields a hierarchy of n-merge systems (0 ≤ n ≤ 3). The apparent interest is that Merge may come in different varieties that only differ in complexity but not in kind. But the authors include no discussion of “complexity metrics” or characterization of “kind.” They provide only the stipulations:
- first, that 0-merge systems apply to meaning-bearing units but do not combine, i.e., C(Ø, α) = α
- second, that 1-merge systems combine unit elements only, i.e., C(α, β) = α^β, though not recursively, and construct simple complexes sufficient for all animal syntax;
- third, that 2-merge systems allow recursive combination of unit elements with previously merged complexes, i.e., C(α, φ) = α^φ; and,
- finally, that 3-merge systems allow merges of two already constructed complexes, i.e., C(φ, ψ) = φ^ψ, where α, β are variable unit elements and φ, ψ are variable strings of unit elements.
In their view, animal syntax, which uses 1-merge grammars, differs from human syntax, which uses 3-merge grammars, only in the complexity of the Merge operation, possibly through an effect of memory limitations.
Suzuki and Zuberbühler’s proposal is essentially a change of terminology. They rename context-free grammars or type-2 rewrite grammars to 3-merge systems, and finite state grammars or type-3 rewrite grammars to 2-merge systems. Their 1-merge systems generate the strictly finite string sets seen in the animal communication systems of Japanese tits (Parus minor), southern pied babblers (Turdoides bicolor), and possibly Campbell’s monkeys (Cercopithecus campbelli). These systems allow simple complexes of two calls only. The finite collections of simple meaning-bearing calls of vervet monkeys (Chlorocebus pygerythrus), for example, are named 0-merge systems. Suzuki and Zuberbühler assert that animal syntax requires no complexity beyond their 1-merge systems.
The amended Suzuki and Zuberbühler hierarchy simply restates part of the Chomsky hierarchy of rewrite systems.31 What the pair fail to understand is that the Merge operation of the BP is different in kind, positioning language outside the Chomsky hierarchy. The proposal is a misleading change of terminology, confusing rewrite systems with Merge systems, and fails to capture the fundamental property of structure dependence of language. There is no gradual complexity either. The recursive versus non-recursive divide, the difference between their 1-merge and 2-merge systems, is a watershed. Unbounded generation cannot be reached in steps from bounded generation. Recursion is all-or-nothing; there is no halfway recursion.32
Suzuki and Zuberbühler’s change of terminology is not innocuous. The Chomsky hierarchy is a hierarchy of classes of formal languages that are generated by rewriting systems. Specifically, all these rewrite systems use string-manipulating operations, or concatenations, that impose linear order and thus differ fundamentally from Merge systems. Merge-based systems of human language do not belong in this hierarchy since Merge, essentially a set formation operation, generates hierarchical structures without linear order. Linear order only results from the externalization operations that must satisfy conditions of sensorimotor systems that are ancillary to internal language.
Suzuki and Zuberbühler concede that their 1-merge systems—i.e., the simplest subregular finite state grammars—suffice for animal communication systems. These systems are not recursive, and they do not generate any hierarchical structure. Structure dependence is therefore ruled out on principled grounds. As a consequence, any conjunctive combinatoriality in these concatenation-based systems must be different from structure-dependent compositional meaning in Merge-based language.
There is more. Consider the proposal that the ability to recognize a non-adjacent dependency between two similar elements “represents a capability that had already evolved in humans’ last common ancestor with squirrel monkeys, and perhaps before.”33 More recently, non-adjacent dependency processing in marmosets and chimpanzees using artificial grammars was claimed to be a “crucial cognitive facilitator of language” and illustrative of an ancestral trait that evolved tens of millions of years before language arrived.34
At face value, the formal languages in these studies capture a non-adjacent dependency between similar elements at the edges of strings. The non-adjacency dependency is across an arbitrary distance in the former study—ABnA type expressions—and a single element in the latter—AXB and CXD type expressions, X a single element. Non-adjacency dependency may be generalized to arbitrary-distance dependency without affecting the argument—AXnB, CXnD. The argument fails, however.
First, there is only a single dependency between two positional elements—“at the edge”—across an arbitrary or unbounded number of string-based intervening elements; there is not an arbitrary or unbounded number of structural dependencies between elements in a hierarchically structured expression. This is a defining property of human language.
Second, these artificial systems are generated and accepted by simple finite state grammars or automata that are part of the Chomsky hierarchy of classes of formal languages. Human language does not belong. It is a generative procedure whose basic recursive operation, Merge, is set formation. In contrast, the classes of grammars and languages that do belong are concatenation-based rewriting systems—finite state, context-free, context-sensitive, unrestricted rewriting systems—that impose linear order but fail to assign structure correctly or at all. The alleged non-adjacent dependencies in these studies are positional, not structural. There is no argument for sensitivity to structural dependence in these studies. The so-called dependences are string-based, not structure-based. Similarities are just superficial and of no consequence, only hiding deep fundamental differences.
Third, there is good reason to doubt that there is a dependency at all. The positional elements are at the edges of strings. The birds may plausibly have associated one or a limited set of stimulus tones with the beginnings and ends of input strings. This is not a dependency but just association of acoustic stimuli with edges. The birds only learned to distinguish beginning from end, which is a capacity shared with many other animals and even with heliotropic plants.35 Here as well, superficial similarities only conceal different underlying mechanisms. The mechanisms involved, auditory perception systems versus circadian regulation of directional growth pathways, clearly differ.
Fourth, it is instructive to see how the argument fails at even simpler levels of abstraction. The argument for non-adjacent dependencies was motivated by reference to vowel harmony in Hungarian, where word-initial and word-final vowels harmonize across intervening neutral stem vowels; that is, they are either back or front, and are either rounded or unrounded for front vowels. Again, appearances are deceptive. Vowel harmony dependencies are structural, not positional. Vowel harmony is not string-based but satisfies strict cyclicity, relying on hierarchical structure, working its way up from stem to a succession of adjoined suffixes, for example, plural and accusative case suffixes, as in [[[könyv]-ek]-et] (“books”) versus [[[város]-ok]-at] (“cities”). Unrounded non-low front vowels are neutral precisely because back vowel counterparts are missing from the language. In fact, neutral vowels cannot always be skipped. There is füzet-ek (“notebooks”) rather than *füzet-ök, evidence that front vowels harmonize for roundness on the basis of the last stem vowel. There is an explanation for that. The stem-final syllable is not neutral to harmonic rounding precisely because Hungarian does have rounded as well as unrounded front vowels.
Non-adjacent dependencies are linear-based and belong to rewrite systems. They are not structural-based, and thus are outside the Chomsky hierarchy. For that reason alone, they are disqualified from comparison. Rather than showing that animal communication systems have properties of rudimentary human language, proof of sensitivity to positional dependencies would provide strong evidence that animals possess the capacity for “impossible languages.”36
Once the true nature of the BP is made explicit, two results immediately follow. First, compositionality in animal communication systems, assuming it exists, has none of the defining properties of compositional meaning in human language. It therefore cannot be an evolutionary antecedent or model for language. Second, the positional effects and reuse of tones in animal vocalizations are localized in the sensorimotor systems for externalization, the part that is ancillary to human language, and, therefore, even if there is an analogy or convergence, it is of peripheral interest only.
There is no implication of any “intrinsic link between language’s two combinatorial layers,” a position wrongly ascribed to me. In fact, the two combinatorial layers of duality of patterning must be independent and asymmetric systems—the BP vs. externalization. This result is an important discovery of generative research about human language, and was recently explained as a consequence of the way human language is organized.37
Compositional meaning depends on linguistic structure. Structure dependence is the first genuine Merge-based explanation of a fundamental property of language.38 The simplest structure-building operation Merge explains that internal language must be blind to the linear order or similar external arrangements that only result from the externalization of Merge-based constructs in some sensorimotor modality. Compositional meaning of human language is constructed from hierarchically structured labeled expressions grounded in internally constructed meaning-bearing elements that may be used to refer to elements of the human internal and external world.39
To illustrate hierarchical structure, consider the expression Eagles that fly instinctively swim, which is ambiguous between [[Eagles that fly instinctively] swim] and [[Eagles that fly] instinctively swim]. The adverb is associated with “fly” in the former but with “swim” in the latter. The ambiguity disappears when the adverb is displaced: Instinctively, eagles that fly swim. The adverb modifies the linearly more distant verb “swim” rather than the more proximate verb “fly,” a direct consequence of structure dependence and unexplained under linear proximity.40 Projectability of labels may be illustrated with the ambiguous sentence Visiting relatives can be boring. “Visiting” can be labeled either a verbal noun or a verbal adjective. Whichever it is yields different interpretations on a par with “visiting these relatives is boring” or “these visiting relatives are boring.” An example of mind-internal lexical formatives showing figure-ground reversal may be John walked through the door he had painted earlier. Here the same word is used in two different senses, to mean both a physical object and an aperture, and these senses can be combined to yield a semantically well-formed expression.41
Animal communication systems do not possess any of these elements. In particular, animal communication systems do not possess Merge, the operation that yields hierarchical structure and seems to be uniquely human and uniquely linguistic. As Suzuki and Zuberbühler admit, animal syntax permits the combination of just one call with another and, therefore, is not hierarchical. Consequently, any conjunctive combinatoriality in animal communication systems must be based on deeply different organizational principles that have no analogs in the structure-dependent compositionality of language.42 The compositionality of animal communication systems is finite, linearly structured, concatenation-based, and non-recursive. Human language is none of these. It is unbounded, hierarchically structured, Merge-based, and recursive. The resemblance with linguistic compositionality is only superficial and without consequence. Furthermore, Merge-based hierarchical systems of the BP could not have evolved from string-based concatenation systems. Consequently, structure dependence and, therefore, the compositional meaning of human language can have no evolutionary roots in animal communication systems, which only impose linear order on vocal elements.
Consider combinatoriality in the vocal systems of nonhuman animals. Engesser and Fitch argue that the AB flight and BAB prompt calls of babblers are meaningful calls that are composed of the shared tones A and B. English speakers similarly combine /oʊ/ and /k/ in “oak” and “coke” to yield minimal meaning-bearing words. Engesser and Fitch suggest that the primitive combinatorial properties of babbler calls show convergence with the much richer phonological systems of human language. Perhaps. At this level of generality, anything goes. Does one swallow make a summer? Furthermore, babblers distinguish a call BAB from CAB or simply AB. This resembles the ability of humans to recognize minimal pairs, such as “table” as distinct from “cable” or “able.” But it is only the ability to recognize when one sound comes before another; it has little to do with real phonology.
Engesser and Fitch correctly observe that many birds have “voluntary control over their vocalizations.” But the references that Engesser and Fitch rely on to establish voluntary control argue that this is a specific capacity of songbirds, such as zebra finches and Bengalese finches. In songbird vocalizations, we find the same tones repeated in complex songs that communicate territoriality or emotive state but convey no specific meanings. These birdsong sound systems may show complex linear structure, sometimes with loops inside loops, and can be characterized as a k-reversible finite state transition network.43 Animal communication systems apparently work differently. They have none or only modest combinatory properties that weakly suggest compositional meaning, are strictly finite systems, and have the simplest subregular grammars. Since songbird vocalizations are under voluntary cognitive control, and may yield quite elaborate songs, we have rich empirical evidence of seemingly combinatory processes that do not result in any compositional meaning. Taking Engesser and Fitch’s perspective seriously, we might sensibly ask: Why, then, wouldn’t compositional meaning have started in some species of songbirds? There would have been plenty of time for evolution. It never happened. It is a fair question to ask. But from our perspective, the question is simply misleading and incoherent. There is a genuine explanation. Only the BP of language, which is missing from animal communication systems, enables the Merge-generated hierarchical structure of internal language that is necessary for compositional meaning. Paraphrasing Marc Hauser, in animal brains, sensorimotor and cognitive neural systems are “locked in place and cannot interact freely.”44 In contrast, in human brains, a preexisting ancient sensorimotor system recently and suddenly came to be linked to a biologically isolated computational system for generating complex thought, constrained by resource restrictions that hold generally for biological organisms. This is an evolutionary accident that may have happened successfully only once. Hence, the appropriate and principled question, Why only us?45
The tone combinations in the babbler call system are stimulus-bound, limited, fixed, and restricted to two distinctive calls only. There is no evidence of voluntary control. These calls are instinct-controlled and need not signal the combination of meaningless sounds to meaning-bearing calls, a productive property of human language. By restricting “voluntary control” to just two tones, and optimizing transmission efficacy, more distinctive combinations would have been possible. In fact, BA and ABA would have been equally distinctive calls, and could have been added to the AB flight and BAB prompt calls. But nothing of the sort happened. Engesser and Fitch suggest that babblers may be “insensitive to the calls’ combinatorial nature,” perceiving and storing these as holistic units. But in a self-defeating argument, they suggest that there may have been a holistic stage for BA and BAB preceding an analytic stage that would enable “the productive exploitation of the combinatorial operation.” The segmentation process does not add anything. It did not lead to more combinations. Prior to segmentation, by definition, duality of patterning is not functional. The meaningless tones that combine to meaningful calls had not yet been segmented. After segmentation, apparently, there has been no use of further recombination. What then could have been the convergent evolutionary drivers for the emergence of combinatorial abilities?
Where are phonological systems localized? In internal language or in externalization? Suppose phonology is localized in internal language and used in externalization, an amalgam of internal language and sensorimotor systems external to language. We think in externalized language (“inner language”), and we determine that “table” rhymes with “cable,” forming a minimal pair, without needing to pronounce them. The elements that narrow phonology yields as outputs (phones/features) are mind-internal objects, which may be spelled out (and interpreted) by articulatory systems (and perceptual systems).46 Animals do not have inner language. Therefore, the alleged repositioning of tones in animal communication systems could not have been a factor in the evolution of language. These systems are a case of involuntary externalization without internal language.
Assume then that the locus of phonology is externalization itself. While this position allows combinatorial phonology to have roots in animal communication systems, it would also assign such systems a peripheral role for evolution of language, namely in the processes that are external to language, essentially an internal system coded in the brains of human individuals. It also raises serious problems. Phonology applies computational operations to hierarchical structure, such as cyclic stress assignment, syllabification into suffixes, vowel reduction, and vowel harmony.47 It yields mind-internal objects with properties that are natural for internal language, but not for externalization systems. It is not logical to conclude from this that babbler vocalization is based on principles of human phonology.
Finally, Engesser and Fitch assert that lack of coarticulation has not been demonstrated experimentally. But that does not seem entirely correct. Formant transitions characteristic of human speech are absent from babbler calls as the experiments of Engesser and colleagues already make clear.48 Permutation of tone A of the prompt call BAB and tone A of the flight call AB did not cause any discriminatory confusion, in contrast to analogous permutations in human phonological systems. Apparently, there is no coarticulation yielding formant transitions in animal vocalization systems.
Nevertheless, human speech and animal vocalizations are naturally expected to share some properties. The analogies and convergences that have been discussed in the literature concern aspects of externalization only. In fact, as mentioned in the original article, comparative biology has been remarkably successful in uncovering significant similarities between birdsong and human speech at the behavioral, neural, genomic, and cognitive levels that relate mainly to the manner of their externalization. In particular, the sensorimotor systems for producing language or birdsong require similar linear arrangements of differently organized structures, and they show convergent neurogenetic organization for brain regions involved in auditory–vocal imitation learning, perception, and production.49 No premature rejection here.