Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published May 2009 | Supplemental Material + Published
Journal Article Open

Fast Statistical Alignment

Abstract

We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/.

Additional Information

© 2009 Bradley et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Received: October 22, 2008; Accepted: April 20, 2009; Published: May 29, 2009. Ian Holmes was partially supported by NIH/NHGRI grant 1R01GM076705. Lior Pachter was partially supported by NSF CAREER award CCF 03-47992. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. FSA borrows heavily from previous work, both in its code base and its intellectual foundations. Ideas. The distance-based approach to multiple alignment was proposed in [13],[14]. This included the idea of modifying the accuracy criterion suggested [53] and [54] to include gaps and the demonstration that the resulting modified expected accuracy could be used to control the expected sensitivity and specificity. Furthermore, [13],[14] introduced the sequence annealing approach to building multiple alignments, via the description of alignments using partially ordered sets [31],[55],[56]. The graph-based approach to alignment was formalized by [57] and these results were used in the DIALIGN program [58]. The query-specific learning method for re-estimating alignment parameters on the fly was inspired by [15] and conversations with Joseph Bradley about query-specific structure learning of graphical models. The iterative refinement technique is based on ideas in [59]. The FSA algorithm was parallelized using a modification of the approach in MW [60]. The coloring in the GUI according to posterior probabilities of alignment is inspired by the AU viewer of BAli-Phy [9]. Software. The sequence annealing implementation in FSA is based on Ariel Schwartz's AMAP program [14], which implements the Pearce-Kelly algorithm [61]. FSA's query-specific learning uses code created with Gerton Lunter's HMMoC compiler for HMMs [15]. The "aligner" example distributed with HMMoC, which implements a learning procedure for gap parameters, was an inspiration for FSA's learning strategies. FSA's banding code is taken directly from the "aligner" example. FSA's sequence and alignment representation code was inspired by similar code in the dart library [62]. Several Perl packages distributed with FSA are derived from packages of the same name in dart. Author Contributions: Wrote the paper: RKB CD LP. Led the development of FSA, wrote most of the code base, and developed the query-specific learning method: RKB. Redesigned the sequence annealing algorithm, constituted the core development team, and managed the project: RKB CD LP. Developed the GUI: AR. Developed a preliminary version of the GUI: MS. Developed the iterative refinement technique: SJ. Developed the parellelization and database modes: JD CD. Provided advice on the dart library, including its algorithms, programming and software components: IH. Created the FSA webserver: LP. The authors have declared that no competing interests exist.

Attached Files

Published - journal.pcbi.1000392.PDF

Supplemental Material - Text_S1.pdf

Files

journal.pcbi.1000392.PDF
Files (1.3 MB)
Name Size Download all
md5:99f9446bab763f58edd30c40dc7d7029
1.1 MB Preview Download
md5:db884e121079ed5d85b5559e34b76427
226.8 kB Preview Download

Additional details

Created:
August 20, 2023
Modified:
October 24, 2023