Primer of Phylogenetic Networks
*Editor’s Note: In “Genealogical Trees and Networks: Insights from Evolutionary Biology,” Ryan McDermott discusses the application of genealogical tree and network analogies in the humanities. He notes that while genealogical concepts are widely invoked in the historical disciplines, relatively few scholars have access to a robust background in genealogy proper. What follows here is a “Primer of Phylogenic Networks” by David Morrison, phylogeneticist and bioinformatician. You can see the following concepts at work in many articles in this journal, including McDermott’s follow-up essay, “Genealogies in Motion: Trees of Consanguinity.”
Introduction
Phylogenetic analysis is now widely used for comparative analyses in all areas of biology. Traditionally, these analyses have used trees to display their results, but in the last 30 years interest has developed in networks, which are a generalization of trees that can be used to investigate both vertical evolution (parent to offspring) and horizontal evolution (transfer among offspring).
In a study of a phylogenetic history, if there are no conflicting patterns of variation in the data then the traditional use of a phylogenetic tree is appropriate; but if there are apparently conflicting data patterns then a phylogenetic network will be more informative. It is necessary to distinguish two types of phylogenetic network, with the distinctly different objectives of displaying the data variation (data-display networks) versus analyzing the evolutionary history (evolutionary networks). The display networks are used for exploratory data analysis (EDA) while the evolutionary networks are used for formal phylogenetic analysis.
This primer presents a brief introduction to these two types of network, showing their relationship to each other, as well as to the original character data and the phylogenetic trees derived from those data. A small dataset from the literature is used for illustrative purposes.
I assume that you already have some familiarity with phylogenetic trees, and that you are interested primarily in expanding your knowledge to include networks.
The Data
Donoghue et al. (Systematic Botany 29:188-198, 2004) were interested in assessing the evolutionary relationships within the plant genus Viburnum, for which they sequenced 1,131 nucleotides of the chloroplast trnK intron and 556 nucleotides of the nuclear ribosomal Internal Transcribed Spacer (ITS). Five closely related species sampled from North America will be discussed here, as a simple but real dataset.
A summary of the nucleotide sequence data for the five species is shown in Figure 1. Each row in the alignment is a DNA sequence from one of the species, and each nucleotide position (column) along the sequence represents a character, with the different nucleotides in a column being different character states of the same character. All of the constant characters (i.e. the same nucleotide in all 5 of the sequences) have been deleted from the Figure (because they tell us nothing about the relationships among the species). Characters 1-12 are from the trnK gene and 13-43 are from the ITS gene (out of 1,131 and 556 nucleotides positions in the original gene sequences, respectively). Note that the data are strictly binary (i.e. no more than two character states per character), and so the data patterns are relatively straightforward.
We will first look at the use of networks for displaying these data and exploring the data patterns in the dataset, and then proceed to the use of networks for evolutionary analysis of the data.
Data Display and Exploration
Looking at a sequence alignment is a valuable way to view and understand the character data, since homologous states of the same character are aligned. The sequence of characters along the DNA can thus be easily compared across the rows, and the comparable character states at each DNA position can easily be compared down the columns.
So, in one important sense the sequence alignment is the phylogenetic data, expressed in a tabular format. However, the relationships among the taxa, whether phylogenetic or otherwise, are not necessarily easy to see from the alignment alone.
This can be addressed to some extent by uniquely coloring each nucleotide, as shown in Figure 2. This makes the comparison of the columns easier by providing visual clues about the patterns among the sequences. For example, it becomes obvious that positions 2 &11 have the same distribution of states among the sequences. However, it is not necessarily obvious that positions 7, 12, 13, 14, 15, 16, 19, 25, 28, 34, 36, 37, 38 & 43 also have this same distribution, because in these cases the pattern is formed by different color combinations.
This issue could be addressed by re-arranging the columns so that similar patterns are adjacent, as shown in Figure 3. This indicates that the characters show 9 distinct patterns of relationship among the sequences, out of the 15 patterns possible. These patterns are supported by a wide variety of combinations of the four nucleotides, although 23 out of the 43 characters have C-T substitutions. Note, also, that this approach does lose the inherent order of the nucleotide positions (i.e. the characters are now unordered rather than ordered).
There are 9 data patterns and so there are 9 steps from the initial arrangement (a–j) to creating the network. At each step an edge (or set of parallel edges) is added to the growing network, representing one of the data patterns. The characters involved at each step are marked on the appropriate branch at each step in the Figure. The length of each edge in the Figure is proportional to the number of characters defining (or supporting) that edge.
So, the edge in 4(b) is the first one created, representing the fact that the character states of characters 1, 5, 20, 35 and 39 split the sequences into two groups—the length of the edge represents the five character-state differences between the two groups. Edges are added (in any order) representing the other possible character-state differences, thus creating smaller and smaller groups, until each taxon is in a group of its own.
At step 4(c) the new groups formed are compatible with the two existing groups, in the sense that the new prunifolium-rufidulum group is a subset of an existing group. Therefore, the new edge to be added simply splits that group.
However, at step 4(d) there is a conflict between the two new groups being formed and the three groups that exist at step 4(c). In order to put lantanoides and lentago in a group together in 4(d) when they are in different groups in 4(c), a pair of edges needs to be added rather than a single edge—each new edge originates in a different group. Thus, characters 32 and 41 are represented by two edges rather than one. Note that characters 1, 5, 20, 35 and 39 are now represented by a pair of parallel edges, as well. The same thing happens at steps 4(e) and 4(f)—a set of parallel edges appears rather than a single edge, each one originating in a different pre-existing group. So, both characters 3 and 42 are represented by three edges; and so too are characters 1, 5, 20, 35 and 39, characters 32 and 41, and characters 17, 18, 27 and 29.
Since the data are binary, every character splits the sequences into two groups, each group being defined by its shared nucleotides. We say that each nucleotide position forms a bipartition of the sequences; and it is the combinations of these bipartitions that form the 9 data patterns. The median network thus displays all of the bipartitions. This is illustrated in Figure 5 for two of the bipartitions.
Note that a splits graph is actually a separation network, in the sense that the lines separate the sequences into groups rather than connecting sequences together. This is obvious from the way the graph is formed, as shown in Figure 4. In Figure 6, all of the characters are marked on the branch representing their bipartition. Clearly, the data are not very tree-like in this example, as the characters show a great deal of conflict. Instead there is a network of reticulating interconnections. Also, note that there is no pattern that is unique to prunifolium, and hence there is no edge separating it from the main part of the network.
Trees
If we perform a tree-building analysis of these data then the conflicts among the characters must be resolved in some way. The simplest approach is to use the character weights. For example, if three characters support a particular data pattern and two characters conflict with it, then the three characters would outweigh the two, and the latter pattern would be ignored when building the tree. The tree is thus built solely from the best-supported unconflicting patterns.
This is shown step by step in Figure 7. This figure is also available as an animation (click to open the animation is a new window—the animation loops indefinitely).
There are 5 patterns that conflict pairwise (see Figure 6), and only two of these can be included in the tree. Thus, there are three steps to deriving the tree, each step eliminating one of the conflicting patterns. In order, characters 42, 3, 32 and 41 are deleted, leaving the remaining characters without conflicts. Biologically, characters 3, 32, 41 and 42 are thus treated as homoplasies on the phylogenetic tree rather than homologies.
This parsimony analysis displays a subset of the character patterns as an unrooted tree. However, in order to represent a phylogeny the tree must be rooted. The root is necessary in order to determine the direction of genealogical history along the edges, which indicates the historical relationships among the species.
For example, the final tree in Figure 7 does not indicate whether lantanoides and nudum are sisters (which they will be if the root is in the right-hand half of the tree) or whether they are more distant relatives (which they will be if the root is in the left-hand half of the tree).
In this instance the root is on the edge connecting lantanoides to the other species, based on outgroup analysis involving other species not used in this dataset. We can draw the rooted Parsimony Tree in either of the two ways shown in Figure 9.
Other Networks and Trees
Obviously, the dataset used here is very simple. Technically, it has binary data that are only ever pairwise incompatible. Larger datasets will rarely meet these two criteria, so that the data analysis becomes much more complicated. There are many network methods that have been developed it deal with these complexities, most of which have been implemented in computer programs.
For example, the median network is often far too complex to be of much use as a representation of the data. Therefore, strategies have been developed to produce networks that reduce this complexity. One possible strategy is to start with the median network and then try to simplify it to some specified degree. Another strategy is to to start with a simple network and then add complexity to it to some specified degree. We will look at an example of both of these here.
For the median network shown above, the reticulated part of the network consists of five nodes that support a pair of boxes each, as labelled in Figure 11(a). If any of these nodes is deleted then a pair of boxes would disappear. Thus, we could simplify the network by deleting one or more of these five nodes. If we do this then we have a Reduced Median Network. (Note that there is also one node that supports five boxes; but, we do not need to concern ourselves about nodes with more than a pair of boxes, because these will automatically be reduced as a result of dealing with the pairs.)
Next, we root these trees as we did for Figure 9, with the root on the branch connecting lantanoides to the other species. The rooted trees are shown in Figure 18. These two trees differ only in the placement of prunifolium: the trnK tree says that prunifolium is the sister to lentago, while the ITS tree says that it is the sister to rufidulum.
The hybridization model says that we should connect prunifolium to both of these tree locations, thus producing the hybridization network. (Technically, we connect the points where the trees differ in rooted-subtree-prune-regraft operations.) This network is also shown in the Figure, clearly indicating that prunifolium is a hybrid between lentago and rufidulum. This is, indeed, exactly what was predicted in the original study, based on the previous biological data.
Note that the hybridization network is produced by (a) removing some of the conflicting data patterns from the network by creating a tree for each gene (so that only the best-supported characters are used in the network), and then (b) displaying the remaining between-tree conflict as reticulations in the network. Also, characters 32, 41 and 42 are still homoplasious with respect to the network (i.e. they cannot be plotted at a single location).
Assumptions
The analyses presented here are based on a set of assumptions, of course. Perhaps the most important one for methods that work directly with characters is that the characters have not been modified multiple times. That is, there have been no "hidden" character changes. For the data presented here, this means that each nucleotide at each alignment position has not been subject to ultiple substitutions.
However, some of the analyses also require that the data be binary. That is, there are no more than two character states for each character. This implies that each character has been modified only once during the evolutionary history of the organisms being studied. For nucleotide data this is called the "infinite sites model" of substitution.
Summary
In this primer I have shown the direct connection that exists between characters and networks, and between the different types of networks. The basic networks are derived from the patterns in the character data; and many of the other network and tree types can be derived from simple network presentations of those patterns.
I have illustrated the importance of EDA in phylogenetics. In the example used, there are many conflicting patterns in the data, and this needs to be taken into account when interpreting the biology, after performing the mathematical analyses.
I have also demonstrated the analysis of suspected hybridization and recombination in a dataset. The recombination analysis is performed on the ordered character data, while the hybridization analysis combines trees derived from those data. Normally, such analyses cannot be performed by hand, due to their mathematical complexity.
Further Reading
Huson D.H., Bryant D. (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23: 254–267.
Huson D.H., Rupp R., Scornavacca C. (2011) Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press, Cambridge.
Huson D.H., Scornavacca C. (2011) A survey of combinatorial methods for phylogenetic networks. Genome Biology and Evolution, 3: 23–35.
Morrison D.A. (2005) Networks in phylogenetic analysis: new tools for population biology. International Journal for Parasitology 35: 567–582.
Morrison D.A. (2010) Using data-display networks for exploratory data analysis in phylogenetic studies. Molecular Biology and Evolution 27: 1044–1057.
Morrison D.A. (2011) Introduction to Phylogenetic Networks. RJR Productions, Uppsala.