Reconstructing Ancestral Proteins

Nelson, Chase

Members of a family resemble one another. Such similarities are most often noticed in contemporaries, as when siblings have the same smile. But they also provide insight about individuals long departed. Absent a historical record, the best bet for inferring the traits possessed by a forebear is to examine all their descendants in the present. It matters neither how long ago the ancestor lived nor how copious their progeny; if 90% of the offspring are tall, it seems reasonable to suppose the primogenitor was too. As the aphorism suggests, the present is the key to the past.

What is true of people is also true of proteins, and for the same reasons. Proteins are molecular machines built from strings of amino acids, typically a few hundred residues in length, chained together with covalent bonds like beads on a necklace. Most organisms employ an alphabet of twenty amino acids in their proteins, with an overall function defined by their precise order—leucine at position 5, lysine at position 26. If two proteins contain identical amino acids at a sufficient number of positions, common ancestry can be inferred and they can be grouped into a family. If every currently known family member contains, say, lysine at position 5, chances are high that the same was true of the ancestral protein.

But proteins aren’t always what they used to be. This is true in the sense that one amino acid occasionally gets swapped for another over the course of evolution. But it is also true in terms of general properties. Each amino acid has its own size, hydrophobicity, and charge. As a result, neighboring residues along the chain undergo biophysical interactions, giving rise to a series of twists and turns. The chain progressively folds into a particular three-dimensional shape and thereby assumes a particular function. Swapping an amino acid can disturb this shape and thereby alter the function, most commonly by changing the suite of molecules with which the protein has the ability to interact.

Inferring the exact amino acids in ancient proteins has long been a topic of interest among theoreticians. Writing in 1963, Linus Pauling and Emile Zuckerkandl introduced the concept of chemical paleogenetics. They reasoned that sequencing enough proteins in the present could allow sufficient numbers to be grouped into families, thus allowing their extinct protein ancestors to be inferred and synthesized in the lab.¹ These proteins could then be used to test a wide range of evolutionary hypotheses about ancient organisms, even those that left no trace in the fossil record. The field blossomed in the 1980s with leaders such as Steven Benner, followed later by Gina Cannarozzi, Betül Kaçar, Joseph Thornton, and many others.

An influential example of ancestral protein reconstruction, conducted by Eric Gaucher and colleagues, involved bacterial elongation factors of the Tu family.² Focusing on bacteria that are currently mesophilic—organisms that grow optimally at temperatures between 20 and 40°C—these researchers found the ancestral protein to function optimally between 55 and 65°C, suggesting that the ancestor was instead thermophilic. This example highlights the necessity of protein reconstruction: it may not be enough to examine the average properties of a protein’s modern descendants. The ancestral molecules themselves must be constructed.

Other applications of ancestral protein reconstruction have involved testing general hypotheses about genome evolution and evaluating specific gene functions to gain a better understanding of the connection between genotype and phenotype.³ Proteins can even be compared directly to their ancestors over an evolutionary tree, circumventing the historical, and therefore statistical, dependencies that have long plagued comparisons among modern forms alone.⁴

Protein Specificity

Given that proteins are associated with repertoires of target molecules, a particular protein can be categorized as relatively specific, having few targets, or general, having many targets. From a theoretical standpoint, the existence of an overall historical trend in the direction of, say, general-to-specific could allow a principle of protein evolution to be established. Indeed, Roy Jensen proposed just such a principle in 1976, reasoning that early cells must have possessed a small collection of genes encoding proteins with relatively general activities.⁵

Protein specificity may seem a mere theoretical curiosity, but it is perhaps the most far-reaching line of research to spring from ancestral protein reconstruction and for one lucrative reason: protein engineering. A protein binding its target is not the end but the beginning of its functional story. The binding of a single molecule might set off a change in the protein’s conformation that ultimately stretches the target until it rips in two. Or, the binding of multiple molecules might stimulate the protein to smash them together until a new product is forged. In either case, the binding of targets by proteins speeds up perhaps every chemical reaction carried out by the cell.

If ancestral proteins tended to be more general than their modern forms, they would have been capable of binding a wider range of targets and driving a greater diversity of potential reactions. They might also have exhibited other valuable features, including increased stability at high temperatures and the ability to fold without the aid of modern so-called chaperone proteins.⁶ In such cases, efforts to engineer new protein applications would be much better spent investigating variations on ancestral proteins instead of their more limited descendants. There have indeed been glimmers that ancestral proteins were more general.⁷

The S100 Family

In a 2018 study, a team including evolutionary biochemists Lucas Wheeler and Michael Harms addressed the question of protein generality by focusing on the S100 protein family, so named due to their solubility in 100% saturated ammonium sulfate.⁸ This family comprises just over twenty members in the human genome, most of them encoded by a cluster of genes located on the long arm of chromosome 1. For the vast majority of S100 proteins, the binding of Ca²⁺ ions transiently exposes a hydrophobic binding pocket which, if occupied by a well-fitting target molecule, forms a stable structure that holds both the Ca²⁺ and the target in place.⁹ Members of the S100 protein family function as calcium sensors, maintaining homeostasis by responding to intracellular Ca²⁺ fluctuations in real time. Many play key roles in cell replication and immune detection.¹⁰

Wheeler and colleagues focused on just two S100 family members: S100A5 and S100A6. This pair is thought to have resulted from the duplication of a single ancestral gene, dubbed ancA5/A6, around 318 million years ago in the common ancestor of mammals, birds, and reptiles.¹¹ The initial result was two identical and therefore redundant copies, but such tandem gene duplication events eventually allow each descendant gene to specialize in just a subset of the ancestor’s tasks.¹² Both S100A5 and S100A6 are known to be overexpressed in certain cancers and to bind some of the same targets, but have slightly different, if poorly defined, functions.¹³

S100 is a particularly well-chosen family for investigating the evolution of protein specificity: this family binds a set of targets so diverse that no one pattern or motif can be used to summarize them. It can at least be said that the targets are typically short protein fragments, or peptides, twelve or more amino acids in length and occurring within larger proteins. Binding probably occurs through a combination of shape complementarity and hydrophobic interactions.¹⁴ Known peptide targets include regions of the proteins sodium/calcium exchanger 1 (NCX1), Siah-interacting protein (SIP), and two commercially available peptides. Focusing on just these four targets, the team first determined which were bound by each human S100 copy: S100A5 bound the two commercial peptides and NCX1, but not SIP, while S100A6 bound only one commercial peptide and SIP, but not NCX1 or the other commercial peptide. The team then used maximum likelihood methods to reconstruct the ancestral ancA5/A6 protein and assess its own binding: remarkably, all four targets were bound by the ancestor.¹⁵ It seemed that in the time since their duplication from ancA5/A6, S100A5 and S100A6 had partitioned binding partners as one might split belongings after a divorce. What was one had become two, and neither quite amounted to what it once was. These results hinted at a trend toward increasingly specific proteins in evolution.

Sequence Space

Despite these results, a sample size of four targets was a limited basis on which to draw any conclusions. As the authors themselves cautioned,

[I]t could be that the proteins both acquired more peptides that we did not sample … while becoming more specific for the chosen set of targets. … Particularly given the large number of targets for these proteins, distinguishing these possibilities will require an unbiased, high-through[p]ut approach to measuring specificity.¹⁶

In a new study, Wheeler and Harms increase the sample size of targets from four to approximately 100,000.¹⁷ To do this, they first note that S100 proteins might conceivably bind peptide targets not produced by a present-day organism. Thus, rather than limiting to known interaction partners, they instead examine a random sample drawn from all 20¹²—that is, 4.096 × 10¹⁵—possible peptides that are twelve amino acids in length. Their final sample is a commercial library of some 10⁹ unique peptides.

Not each of the 10⁹ potential targets is actually bound by S100A5 or S100A6. To home in on those that are, the authors turn to quantitative phage display, in which gene fragments encoding random peptide targets are inserted into the surface protein gene of bacteriophages. Each phage expresses one target. The whole pool of phages is then mixed with S100 to allow binding, after which the S100 is isolated. Those phages expressing a target bound by an S100 will remain stuck, while the others are washed away. Finally, the bound phages undergo sequencing of their genes en masse to identify which 12-amino acid peptides were bound. Another experiment is run in parallel with a competitor peptide already known to occupy S100’s binding pocket, allowing the results to be normalized in the event that some of the targets bind indiscriminately to other parts of the S100s.

The results are something of a surprise: instead of increasing specificity with fewer targets, the authors observe a pattern of shifting specificity. S100A5 and S100A6 still bind just a subset of the ancestor’s targets; but, when assaying this large sample, it becomes clear that the loss of old targets is offset by a gain of new targets. The specificities of the S100s have not, in fact, changed in comparison to their ancestor. Using alternative, and less likely but still plausible, ancestors yields generally consistent results.¹⁸

Limiting studies to a handful of known interaction partners had given the wrong picture. To more accurately characterize a protein’s intrinsic specificity, unbiased sampling of targets—including those not known in modern organisms—is necessary.

Protein Evolution

The study of molecular evolution seeks to reveal how the building blocks of organisms have changed over time, both in pattern and in process. Wheeler and Harms have charted new territory in describing the pattern of evolution at the protein level, albeit for a single pair of closely related proteins. Can any implications be drawn from their work about what drives protein evolution?

One of the most impressive answers to this question comes from a 2008 study by Shozo Yokoyama and colleagues.¹⁹ These researchers resurrected the protein ancestors and intermediates of thirty-eight modern-day rhodopsins, visual pigments used by vertebrates to see dim light. Such light is most prevalent at dusk and has a wavelength between 400 and 500 nanometers (nm), with lower wavelengths penetrating to deep sea environments. Because the wavelength of maximal absorption, λ_max, can be easily assayed in the lab, its biological function is straightforward to assess quantitatively. As might be expected, deep-sea fish tend to have rhodopsins with a lower λ_max than shallow water fish and terrestrial animals, ~480 nm versus ~500 nm.²⁰ The researchers therefore sought to characterize each amino acid change leading to these different λ_max values.

Across the rhodopsin’s length of 354 amino acids, 203 positions underwent an amino acid change but only twelve of these affected λ_max. The team’s more recent work on visual pigments yields similar results: “[T]he effects of the small proportion of adaptive sites on the evolutionary rates are buried among those of the neutral changes.”²¹ Most amino acid changes that are accepted during visual pigment evolution have little or no discernible effect on function. Statistical methods of sequence analysis generally fail to predict those that do.²² With notable exceptions, the same is true of most proteins—neutral changes dominate.²³

The researchers made another stunning observation: similar shifts in λ_max and specific amino acid changes both recurred multiple times in evolution. The independent recurrence of the same evolutionary substitutions could result from at least two mechanisms. In the first, all mutations occur at similar rates, but those causing functional differences are promoted by natural selection. In the second, specific mutations recur at disproportionately high rates. Although selection is sure to have played an important role in protein evolution, the fact that specific mutations recur, sometimes multiple times, points to the contribution of the second option: there is a bias in the mutational input itself.

Take the most important example in mammals: the CpG dinucleotide, that is, a C followed by a G on the same DNA strand. In most of the genome, C → T mutation rates are ten times higher at CpG sites than at Cs in other contexts.²⁴ Because they are more likely to occur, they are also more likely to recur, substantially biasing the pool of variation upon which selection can act. Although the C → T mutation rate is not quite so high in protein-coding genes, it is still elevated. Given that new amino acids arise due to changes in the underlying DNA, it is conceivable that the known CpG bias could give rise to a bias in protein evolution. The DNA triplet ACG, for example, encodes the amino acid threonine. But because ACG contains a CpG dinucleotide, it is subject to an elevated rate of C → T transitions resulting in ATG, which encodes methionine. The extent to which protein evolution is driven by directionality in the mutational input itself remains to be systematically evaluated, but work by Jay Storz and colleagues suggests it is substantial.²⁵

Global Trends

Ancestral proteins can only be inferred for modern proteins similar enough to be grouped into families, of which S100s are one example. This leaves out deeper protein relationships between families. As a rule of thumb, a pair of proteins matching at fewer than ~30% of their positions cannot be confidently aligned. This is because such levels of similarity are likely due to chance alone.²⁶ As a result, it is only possible to scratch the surface of evolutionary history—only those proteins which diverged relatively recently remain similar enough to compare with confidence. The deepest questions about the origins of novel gene families remain shrouded in mystery.

This includes protein specificity. The ancestor of S100A5 and S100A6 may not have been more general than its descendants—but is the same true of other proteins and protein families? If so, how far back can this trend, or lack thereof, be extrapolated? Which targets were actually present in their environments? How did the primordial archetypes which gave rise to the modern protein families evolve, and had their own ancestors been more general? And, are more general proteins easier to chance upon in sequence space, that is, could they be reasonably expected to have arisen as evolutionary starting points?

To answer these questions, a lot more work just like that of Wheeler and Harms will need to be done.²⁷

Reconstructing Ancestral Proteins

Letters to the Editors

Protein Specificity

The S100 Family

Sequence Space

Protein Evolution

Global Trends

More from this Contributor

More on Biology