In a publication titled “Russian Doll Genes and Complex Chromosome Rearrangements in Oxytricha trifallax,” Jasper Braun et al. explore what they describe as “architectures that transcend simple twists and turns of the DNA.” The paper is short, dry, clear, and interesting.
Oxytricha trifallax is a unicellular eukaryotic species and a ciliate, one widely known for beautiful, but bizarre, genetic acrobatics. Members of O. trifallax possess two nuclei within their single-cell frame. The two nuclei are analogous to the diploid somatic cells and the haploid reproductive cells found in multicellular eukaryotes. In ciliates, the physically larger nucleus is called the macronucleus; the smaller one, the micronucleus. Chromosomes in the macronucleus are accessed for ordinary biochemical affairs. DNA in the micronucleus is involved only in reproduction.
Under ordinary circumstances, O. trifallax reproduces asexually by cloning. Under conditions of stress, one cell meets another in sexual conjugation. What is odd is that, in O. trifallax, all conjugal events begin and end with exactly two individuals. Each cell exchanges 50 percent of its micronuclear DNA. Both leave transformed. After O. trifallax recombines and dissociates from its partner, it goes on increasing its numbers by cloning.
The macro- and micronucleus within O. trifallax differ in the quantity of sequence information they contain, and display a different ordering of genes and gene pieces. During sexual conjugation, the old macronucleus disintegrates. After conjugation, DNA from the micronucleus is copied and amplified. The modified copy forms a new macronucleus. This is no trivial process. Perhaps 90–95 percent of the DNA sequence information in the micronucleus is expunged at precise locations as a new macronucleus is formed.
The macronucleus is physically bigger than the micronucleus. It has less by way of sequence information because what little sequence information survives the purge from the micro- to the macronucleus is amplified. The ensuing macronucleus is poor in information, but rich in copies of itself. The organizations of the two nuclei are dissimilar. The micronucleus has roughly one hundred long chromosomes. Each houses many genes and gene fragments. The macronucleus has on the order of twenty thousand distinct nanochromosomes, each present in something like a thousand copies. Each nanochromosome is small by familiar standards. Containing between one and eight genes, each is provided with regulatory elements and telomeres, and is, on average, just three thousand nucleotides long. Nanochromosomes form when the chromosomes of the micronucleus are broken into shards, and then stitched back together and amplified.
Gene ordering is frequently different in micro- and macronuclei. If DNA segments in the micronucleus are out of order, there must be a sophisticated process whereby the cell snips the intervening sequence and puts them in the correct order. A macronuclear destined sequence (MDS) is a DNA fragment of the micronucleus that joins the macronucleus in a nanochromosome. An internally eliminated sequence (IES) designates a DNA segment that is resident between two MDSs on a given nanochromosome. IESs are snipped out between MDSs, and the MDSs are later tied together.
How is the correct order regained in the macronucleus? The cell solves this problem in a manner akin to how computers store and retrieve files. There are short nucleotide pointers at the beginning, the end, or both ends of each MDS.1 Chemically recognizing these pointers, the cell uses them for correct gene placement. Thereafter, the cell is referred to the next appropriate MDS, aligning them by way of specialized RNA molecules. These are already present in the cell. One nucleotide pointer is attached to the end of a particular MDS. An identical clone is attached to the beginning of the next MDS. The cell matches these twins together, although apparently only one survives into the macronucleus. Since each pointer occurs twice, Braun et al. refer to these as double occurrence words (DOWs). Pointers and the MDSs to which they attach are named according to their position in the nanochromosome.
Suppose a macronucleus nanochromosome encodes a single gene, and that this nanochromosome is formed from three separate gene fragments. Let these three fragments be designated as M1, M2, and M3. It turns out that, in the micronucleus, each MDS has at least one attached pointer, and in many cases two. In the micronucleus, MDS order may be scrambled relative to the macronucleus. Instead of M1, M2, and M3, the micronucleus ordering might be M3, M1, M2. There are pointers or words at the beginning and ending of each MDS that serve to guide the cell: 2M3, M11, 1M22. The correct order in the macronucleus is obtained when identical pointer numbers are joined consecutively: M11, 1M22, 2M3, yielding M1, M2, and M3. This is the correct final order in the macronucleus.
In the micronucleus of O. trifallax there are genes within other genes. One gene may open to contain another, and that one may open to contain another. Suppose a given nanochromosome is composed of M1 and M2. Now suppose there is another, different nanochromosome, M*, composed of just one MDS. In the micronucleus, their order could be arranged in a number of ways. One possibility is M1, M*, M2. The intervening sequence between M1 and M2 is snipped. This excised sequence itself contains an MDS for a different nanochromosome. This the cell recognizes, preserving the MDS sandwiched in the middle.
This arrangement allows for varying degrees of complexity. In the micronucleus, the complete set of pieces comprising one gene may exist between two MDSs for another gene. In turn, all of the MDSs for this gene may be contained between two MDSs for another gene, and so on. Sometimes all of the MDSs for a given nanochromosome fall neatly between two MDSs for another nanochromosome. Such gene pieces are referred to as embedded. At other times, only some, but not all, are positioned this way. In this case, the pieces are interleaved. Embedded or interleaved genes within genes may be scrambled. Or they may exist in perfect unscrambled consecutive order.
Results
In considering sequence data for O. trifallax, Braun et al. paid specific attention to the amount and nature of gene nesting, and the specific mathematical patterns present in the list of pointers. The team also asked whether there were any connections or correlations between patterns of gene nesting and patterns of gene scrambling.
The insertion depth index measures the number of genes inside genes. The embedding index measures the number of genes outside genes. Still another metric characterizes scrambling. There are repeat and return words in scrambled genes. A repeat word occurs when some run of characters in the pointer recurs later in the list. 12341234 is a repeat word. A return word occurs when some run of numbers in the pointer list appears in reverse order. 12344321 is a return word. Pointers can contain long lists of characters comprising combinations of small repeat and return words. Braun et al. classify the complexity of gene scrambling using these words. Classification is done by successively eliminating instances of repeat and return words, and then counting how many eliminations reduce the pointer to the empty word.
Consider the pointer 121342566534, containing a small return word in the middle. Eliminating this return word, we see that 34 occurs twice, and so it too can be eliminated. This leaves 1212, which can also be eliminated, leaving the empty word. Over 90 percent of scrambled genes can be characterized by combinations of repeat and return words. Those that remain have an irreducible core of characters. Braun et al. found other mathematical patterns in these cores.
Eight percent of nanochromosomes display some level of genes within genes in the micronucleus. Some had an insertion depth index of four. Braun et al. also found a correlation between the amount of MDS scrambling and the insertion depth index. Genes with higher insertion depth index were more likely to be scrambled. This same phenomenon did not appear in the embedded index. Whether a gene was surrounded by many genes, or by few, there was no correlation with the amount of gene scrambling that gene might exhibit.
Braun et al. also succeeded in characterizing pointers known as tangled cords. Most pointers can be fully reduced to an empty word by successive elimination of repeat and return words. Most, but not all. Many of the double-ordered words conceal a simple recursive pattern.2 The first term of the pattern consists of a small repeat word. Two identical symbols placed before and after its final symbol form a second word. Two more identical symbols placed before and after the last symbol form a third word. Thus: 1212, 121323, 12132434, 1213243545, … . Braun et al. found that tangled cords were widespread in those pointers that could not be reduced to an empty word.
Braun et al. closes by posing a series of questions. Is there any relationship between genes within genes, and other properties of the genome? Do higher rates of either gene scrambling or genes within genes correspond to higher rates of gene expression? How do nested architectures arise? Do gene scrambling and nesting reflect nothing more than luck, or do nested patterns reflect some sort of cellular algorithm?
The paper has a few minor shortcomings. The notion of a tangled cord is fine, but Braun et al. might have been a bit too lenient in what they counted as a tangled cord. One of the authors, Lukas Nabergall, notes that 21 pointer words were classified as containing tangled cords in the study, but only two were exact.3 The rest were merely close, but were nonetheless counted as tangled cords. Pointer words matched the pattern of a tangled cord only when single or multiple letters were inserted, deleted, or inverted.
Readers could benefit from a more developed discussion of pointers. It would seem pointers can only take one of three generic forms with respect to a given MDS:
- M11, if it is the first MDS of a nanochromosome in the macronucleus sequence;
- i–1Mii, if it is a middle MDS in sequence; and
- i–1Mi, if it is an ending MDS.
So, for example, M11, 1M22, 2M3, respectively.
We could expand this number from three to six if we count the inversions of each of these patterns. But for a word like 121342566534, which is mentioned as an example in the text, to what gene fragment, or MDS, is the bolded pointer (2) in the middle attached? It must be attached to some middle MDS. This would seem to have the form i–1Mii, or else its inversion, and so would require either a 3 or a 1 next to the 2 in the pointer list, not a 4 or a 5. No explanation is given, but the reader could gain valuable insight from one.