Skip Navigation

Systematic Biology 2008 57(4):653-657; doi:10.1080/10635150802302476
This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lehtonen, S.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Lehtonen, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 Society of Systematic Biologists

Phylogeny Estimation and Alignment via POY versus Clustal + PAUP*: A Response to Ogden and Rosenberg (2007)

Edited by Tim Collins

Samuli Lehtonen

Department of Biology, Section of Biodiversity and Environmental Science, University of Turku FI-20014, Turku, Finland; E-mail: samile{at}utu.fi

Received February 5, 2008; Revised March 25, 2008; Accepted May 7, 2008 The performance of the traditional two-step approach (independent alignment and tree search) employed in molecular systematics was recently compared to direct optimization (DO; Wheeler, 1996) by Ogden and Rosenberg (2007a). In their analyses the two-step approach clearly outperformed DO both in alignment and topological accuracy, and the performance of DO was found to decrease with increasing search effort. The results suggested to Ogden and Rosenberg (2007a) that the two-step approach is superior to DO and therefore should be preferred when analyzing data sets similar to those generated in their study.

The settings they used in the DO and two-step analyses, however, were not comparable and biased their results in favor of the two-step approach. Specifically, more complete TBR branch swapping algorithms were used in the two-step analyses but not in the DO analyses. Moreover, dynamic homologies were approximated with static homologies during cladogram optimization. In addition, branch swapping during DO was limited to minimum cost trees, but the default setting of POY performs swapping on all trees found during the search. Finally, they carried out only four replicates in DO versus 100 in the PAUP* analyses. These features were also found in their more extensive analyses, which were based on four replicates, used static homologies during swapping, and were restricted to minimum trees only. These analytical shortcuts are required to perform large number of DO analyses in a short period of time, but in my opinion Ogden and Rosenberg's (2007a) use of POY compromised the main objective of their study: the comparison of DO and the traditional approach.


    Material and Methods
 Top
 Material and Methods
 Results and Discussion
 Acknowledgment
 References
 
Inspired by the study of Ogden and Rosenberg (2007a), I generated 10 random topologies of 16 taxa (Fig. 1) under a Yule model (Steel and McKenzie, 2001) in Mesquite (Maddison and Maddison, 2004). The maximum evolutionary distance between paired taxa was set to 2.0. For each topology 10 simulations were conducted using MySSP (Rosenberg, 2005a). Simulations were based on the Hasegawa-Kishino-Yano (HKY) model (Hasegawa et al., 1985) with transition-transversion bias {kappa} = 3.6, and with initial and expected base frequencies of A and T = 0.2 and G and C = 0.3. Initial sequence length was set to 2000 base pairs. Insertions and deletions were modeled using an expected rate of one insertion for every 100 substitutions and one deletion for every 40 substitutions, and the lengths of indels were determined from a Poisson distribution with a mean equal to four bases. Hence, the settings followed the simulations with maximum evolutionary distance of 2.0 made by Ogden and Rosenberg (2007a).


Figure 1
View larger version (78K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Figure 1 The 10 random Yule trees used to simulate the data of this study.

 
Before the data were analyzed, all gaps were removed from the simulated sequences. Analyses using the traditional two-step method were run by aligning sequences with default values (gap opening = 15, gap extension = 6.66) in ClustalX (Thompson et al., 1997), and then analyzing multiple sequence alignments in PAUP* (Swofford, 1998) with 100 random additions and TBR branch-swapping; otherwise, default settings were used (e.g., gaps were treated as missing data). Three sets of DO analyses were performed, each one using equal costs for gap opening, gap extension, transversions, and transitions. First, the command line (-gap 1-nooneasis-noleading-norandomizeoutgroup-quick-staticapprox-notbr-replicates 4) of Ogden and Rosenberg (2007a) was run with POY v. 3.0.11. (Wheeler et al., 2003). These analyses are later referred to as "POY3-O&R" analyses. Second, I examined the impact of TBR branch-swapping by rerunning all the POY searches, but allowing TBR branch-swapping (command-notbr replaced with -tbr, otherwise the command line was identical). These analyses are referred to as "POY3-TBR" analyses. Third, in order to make a more equitable comparison between two-step method and DO, I ran a set of POY analyses using the default search strategy. Due to the extreme time demand of extensive DO analyses, I used a newer, much faster version of POY (Varón et al., 2007). The command line used was the basic search of the POY4 manual (Varón et al., 2007) with an additional command "transform" to set equal weighting [transform((all, tcm:(1,1))) build (100) select (unique) swap (threshold:10) select ()]. This search generates 100 random addition trees, discards duplicate trees, and performs SPR and TBR branch swapping for all trees and for all new trees that are up to 10% longer than the tree under swapping. I consider this to be the default search strategy of POY and refer to this hereafter as POY4-basic. Commands employed in these POY analyses are briefly described in Table 1.


View this table:
[in this window]
[in a new window]

 
Table 1 Brief descriptions of commands used in POY analyses.

 
Performance of these approaches was quantified by determining the congruence of the inferred trees directly with the correct topologies by calculating the symmetric-difference distance (also known as Robinson-Foulds distance; Robinson and Foulds, 1981) between the inferred trees and true topologies with PAUP*. The alignments produced by Clustal and POY were compared to true alignments by calculating the proportion of aligned sites that are truly homologous (total alignment accuracy; Rosenberg, 2005b). The possible differences in the performance between Clustal+PAUP* and POY analyses were examined with SAS Statistical Package, version 9.1 (SAS Institute, Cary, NC, USA).


    Results and Discussion
 Top
 Material and Methods
 Results and Discussion
 Acknowledgment
 References
 
Ogden and Rosenberg's (2007a) statements that Clustal+PAUP* analyses outperform POY analyses and that more extensive POY analyses yield poorer results than less extensive POY analyses seem incorrect based on the following results of my study.

First, POY4-basic searches produced more correct topologies than Clustal+PAUP* 46% of the time (Table 2), whereas Clustal+PAUP* topologies were more correct than POY4-basic topologies 23% of the time. POY4-basic found the exactly correct topology 20 times and Clustal+PAUP* 15 times (out of 100). In those cases where POY4-basic outperformed Clustal+PAUP*, it did it on average by symmetric distance difference of 2.5, and in the cases where Clustal+PAUP* outperformed, the difference was on average 2.2. The worst symmetric difference distance for a POY4-basic topology was 6, but 12 for Clustal+PAUP* topology. In total, 12% of the Clustal+PAUP* topologies were more inaccurate than the most inaccurate POY4-basic topology.


View this table:
[in this window]
[in a new window]

 
Table 2 The relative performance of Clustal+PAUP* and POY4-basic analyses (see text for details). No statistically significant difference was found in the performance of POY4-basic and Clustal + PAUP* when tree distances were compared, even though Clustal alignments are more accurate than POY-implied alignments. Note that default alignment parameters differ markedly between these two approaches.

 
Second, on average, the symmetric difference distance of Clustal+PAUP* topologies and true topologies was 3.13 (± 2.77 [standard deviation]), whereas the symmetric difference distance of POY4-basic topologies and true topologies was on average 2.49 (± 1.79 [standard deviation]; Table 3). However, the difference was not statistically significant (Wilcoxon two-sample test: Z = 0.91, P = 0.36, n = 100).


View this table:
[in this window]
[in a new window]

 
Table 3 Means and standard deviations of symmetric difference distances (tree distance) and total alignment accuracies (n = 100 in each case). Lower values in tree distance indicate better congruence. In alignment accuracy higher values are better with a maximum value of 1 (= 100% similarity between true and hypothesized alignments). On average, POY4-basic topologies are the best, although no statistically significant difference between POY4-basic and Clustal+PAUP* topological accuracy was found.

 
Third, POY3-TBR outperformed POY3-O&R 33% of the time (and was outperformed by POY3-O&R 24% of the time) and resulted in trees that were on average better (Table 3). POY3-TBR found the correct topology 13 times, and POY3-O&R 9 times. On average, the more extensive POY4-basic search found clearly better topologies and outperformed both POY3-O&R and POY3-TBR 41% of the time (and was outperformed by at least one of the two 19% of the time). The same trend was evident in tree lengths: 60% of the POY3-TBR trees were more parsimonious than POY3-O&R trees of a corresponding simulation, and 98% of the POY4-basic trees were more parsimonious than the corresponding POY3-TBR trees.

Hence, the more extensive POY searches clearly resulted in better topologies than less extensive POY searches and were able to find the exactly correct trees much more often. This means that Ogden and Rosenberg's (2007a:186) justification to draw general conclusions from their less extensive POY searches was mistaken. It should also be noted that the performance of POY and Clustal+PAUP* varied more from topology to topology than between the different simulations on the same topology (Kruskal-Wallis test: {chi}2 = 65.7, df = 9, P < 0.0001 for POY4-basic analyses, and {chi}2 = 51.1, df = 9, P < 0.0001 for PAUP* analyses). This means that a study based on a relatively few topologies but many simulation replicates per topology may be biased. The results presented here show that the two-step approach is not better than DO in recovering topologies, as claimed by Ogden and Rosenberg (2007a).

It has been shown that topological accuracy tends to increase by decreasing alignment error, although in some reported cases the quality of the tree and alignment seem to be independent (Ogden and Rosenberg, 2006). Interestingly, in these simulations POY-implied alignments were always poorer than the corresponding Clustal alignments as measured by total alignment accuracy (Table 2), despite the fact that no significant difference in topological accuracy was found. This suggests that POY-implied alignments behave somehow differently than Clustal alignments. Total alignment accuracy score is based on aligned sites only, and gaps that are aligned with gaps or nucleotides are ignored (Ogden and Rosenberg, 2006), but it is difficult to say whether this could have a different impact on POY and Clustal alignment accuracy scores. Alignments are highly sensitive to the parameters employed (Morrison and Ellis, 1997), and the parameters used here in Clustal and POY analyses were drastically different. Clustal alignment accuracy varied from 81.4% to 16.9% with an average of 50%, and the accuracy of POY4-basic alignments varied from 63.7% to 14.2% with an average of 36.2%. Thus, both methods produced relatively incorrect alignments on average but were still able to reconstruct quite accurate trees. It is my opinion (but not only mine; see, e.g., Morrison and Ellis, 1997) that the ability to correctly resolve evolutionary relationships of the sequences (i.e., topological accuracy) allows more reasonable comparison of different phylogenetic reconstruction methods, optimality criteria, and parameter settings than sequence alignments as such. Furthermore, the main usage of POY is to recover phylogenies without the need to align sequences.

The most extensive DO analyses of the present study were performed with POY4, which has faster algorithms and better heuristics than POY3 used by Ogden and Rosenberg (2007a) and Varón et al. (2007). Although the programs differ, they are expected to do the same thing. The great speed-up provided by POY4 made it possible to fairly compare DO and two-step approaches over relatively many simulations. Yet, my test runs in HP CP4000 BL ProLiant supercluster (Finnish IT Center for Science [CSC]) suggest that reanalyzing Ogden and Rosenberg's (2007a) 15,400 simulations would require years of nonstop computing in a supercomputer cluster with a basic POY4 search and probably decades with a comparable search in POY3. Thus, analyzing thousands of simulated data sets, even with default POY settings, is not very practical. But in any real study where months or even years are spent in collecting data, a few more hours used for the data analysis should not matter much.

It should be remembered that by treating gaps as missing data all the phylogenetic information related to indel events is lost. For this reason it is generally advised to code gaps as fifth character state (Giribet and Wheeler, 1999; Simmons and Ochoterena, 2000; Ogden and Rosenberg, 2007b), but in this study the default setting of PAUP* was used. PAUP* might have performed better if the indel information would have been used. POY4-basic analyses were not especially extensive either, and more sophisticated refining techniques could have been used (see, for example, Goloboff, 1999). Therefore, neither PAUP* nor POY analyses in this study were optimal, and probably both approaches could perform better.

Both in this study and Ogden and Rosenberg's (2007a) study, the less extensive POY searches were clearly outperformed by traditional Clustal+PAUP* analyses, but this does not mean that the traditional two-step approach would be a better choice than DO, as suggested by Ogden and Rosenberg (2007a). It merely shows that effective tree search strategy is highly important, especially in the DO, but even a default POY search recovers topologies at least equally well as default Clustal+PAUP* approach. More research by means of more realistic simulations, e.g., several data partitions of varying evolutionary rates analyzed with unbiased search strategies, might further illuminate the strengths and pitfalls of the two-step approach and DO.


    Acknowledgment
 Top
 Material and Methods
 Results and Discussion
 Acknowledgment
 References
 
Michael S. Rosenberg kindly provided code for calculating total alignment accuracy, and Toni Lehtonen translated it to the programming language Python for use in this study. I thank Pälvi Salo for her help with statistical tests and reviewers of previous drafts of this paper their helpful critiques. This study was funded by an Academy of Finland grant to Hanna Tuomisto.


    References
 Top
 Material and Methods
 Results and Discussion
 Acknowledgment
 References
 

    De Laet J., Wheeler W. POY version 3.0.11 (2003) (W. C. Wheeler, D. Gladstein, and J. De Laet, May 6 2003). Command line documentation.

    Giribet G., Wheeler W. C. On gaps. Mol. Phylogenet. Evol. (1999) 13:132–143.[CrossRef][Web of Science][Medline]

    Goloboff P. A. Analyzing large data sets in reasonable times: Solutions for composite optima. Cladistics (1999) 12:199–220.[CrossRef]

    Hasegawa M., Kishino K., Yano T. Dating the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. (1985) 22:160–174.[CrossRef][Web of Science][Medline]

    Maddison W. P., Maddison D. R. Mesquite: A modular system for evolutionary analysis. Version 1.05 (2004).

    Morrison D. A., Ellis J. T. Effects of nucleotide sequence alignment on phylogeny estimation: A case study of 18S rDNA of Apicomplexa. Mol. Biol. Evol. (1997) 14:428–441.[Abstract]

    Ogden T. H., Rosenberg M. S. Multiple sequence alignment accuracy and phylogenetic inference. Syst. Biol. (2006) 55:314–328.[Abstract/Free Full Text]

    Ogden T. H., Rosenberg M. S. Alignment and topological accuracy of the direct optimization approach via POY and traditional phylogenetics via ClustalW + PAUP*. Syst. Biol. (2007a) 56:182–193.[Abstract/Free Full Text]

    Ogden T. H., Rosenberg M. S. How should gaps be treated in parsimony? A comparison of approaches using simulation. Mol. Phylogenet. Evol. (2007b) 42:817–826.[Web of Science][Medline]

    Robinson D. F., Foulds L. R. Comparison of phylogenetic trees. Math. Biosci. (1981) 53:131–147.[CrossRef][Web of Science]

    Rosenberg M. S. MySSP: Non-stationary evolutionary sequence simulation, including indels. Evol. Bioinformatics Online (2005a) 1:51–53.

    Rosenberg M. S. Evolutionary distance estimation and fidelity of pair wise sequence alignment. BMC Bioinformatics (2005b) 6:102.[CrossRef][Medline]

    Simmons M. P., Ochoterena H. Gaps as characters in sequence-based phylogenetic analyses. Syst. Biol. (2000) 49:369–381.[Free Full Text]

    Steel M., McKenzie A. Properties of phylogenetic trees generated by Yule-type speciation models. Math. Biosci. (2001) 170:91–112.[CrossRef][Web of Science][Medline]

    Swofford D. L. PAUP*: Phylogenetic analysis using parsimony (*and other methods). Version 4 (1998) Sunderland, Massachusetts: Sinauer Associates.

    Thompson J. D., Gibson T. J., Plewniak F., Jeanmougin F., Higgins D. G. The ClustalX windows interface: Flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. (1997) 24:4876–4882.

    Varón A., Vinh L. S., Bomash I., Wheeler W. C. POY. Version 4.0 Beta 1908 (2007) http://research.amnh.org/scicomp/projects/poy.php American Museum of Natural History, New York. Documentation by A. Varón, L. S. Vinh, I. Bomash, W. Wheeler, I. Tëmkin, K. M. Pickett, J. Faivovich, T. Grant, and W. L. Smith.

    Wheeler W. Optimization alignment: The end of multiple sequence alignment in phylogenetics? Cladistics (1996) 12:1–9.[CrossRef][Web of Science]

    Wheeler W. C., Gladstein D., De Laet J. POY. Version 3.0.11. American Museum of Natural History (2003) New York.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Extract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lehtonen, S.
Right arrow Search for Related Content
PubMed
Right arrow Articles by Lehtonen, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?