Fragment assignment in the cloud with eXpress-D

Creators: Roberts, Adam; Feng, Harvey; Pachter, Lior

Abstract

Background: Probabilistic assignment of ambiguously mapped fragments produced by high-throughput sequencing experiments has been demonstrated to greatly improve accuracy in the analysis of RNA-Seq and ChIP-Seq, and is an essential step in many other sequence census experiments. A maximum likelihood method using the expectation-maximization (EM) algorithm for optimization is commonly used to solve this problem. However, batch EM-based approaches do not scale well with the size of sequencing datasets, which have been increasing dramatically over the past few years. Thus, current approaches to fragment assignment rely on heuristics or approximations for tractability. Results: We present an implementation of a distributed EM solution to the fragment assignment problem using Spark, a data analytics framework that can scale by leveraging compute clusters within datacenters–"the cloud". We demonstrate that our implementation easily scales to billions of sequenced fragments, while providing the exact maximum likelihood assignment of ambiguous fragments. The accuracy of the method is shown to be an improvement over the most widely used tools available and can be run in a constant amount of time when cluster resources are scaled linearly with the amount of input data. Conclusions: The cloud offers one solution for the difficulties faced in the analysis of massive high-thoughput sequencing data, which continue to grow rapidly. Researchers in bioinformatics must follow developments in distributed systems–such as new frameworks like Spark–for ways to port existing methods to the cloud and help them scale to the datasets of the future. Our software, eXpress-D, is freely available at: http://github.com/adarob/express-d.

Additional Information

© Roberts et al.; licensee BioMed Central Ltd. 2013. This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Received: 13 September 2013. Accepted: 18 November 2013. Published: 7 December 2013. We thank Matei Zaharia, Kristal Curtis, and Reynold Xin for discussions on the feasibility of the Spark implementation. Adam Roberts was supported by an NSF graduate research fellowship. Lior Pachter was partially supported by NIH HG006129. Authors' contributions: AR developed the method. AR and HF implemented the method and analyzed the results. AR, HF, and LP wrote the manuscript. All authors read and approved the final manuscript. Availability and usage: eXpress-D and Spark are open source software that can be downloaded from their respective websites, http://github.com/adarob/express-d and http://spark.incubator.apache.org/. For ease of use, the eXpress-D source code includes a copy of a Spark script that allow users to launch, setup and manage EC2 clusters running Spark and HDFS. The script can be used to launch all nodes in the cluster using a customized Amazon Machine Image (AMI)-a type of templated operating system [24]-that is preloaded with eXpress-D source and binaries. Target and fragment datasets can then be loaded into HDFS or S3 for distributed execution. The eXpress-D wiki page includes more detail about using the script to launch clusters, as well as notes on cluster configuration and tuning. The authors declare that they have no competing interest.

Attached Files

Published - art_3A10.1186_2F1471-2105-14-358.pdf

Supplemental Material - 12859_2013_6238_MOESM1_ESM.ZIP

Supplemental Material - 12859_2013_6238_MOESM2_ESM.pdf

Supplemental Material - 12859_2013_6238_MOESM3_ESM.pdf

Supplemental Material - 12859_2013_6238_MOESM4_ESM.pdf

Files

art_3A10.1186_2F1471-2105-14-358.pdf

Files (772.8 kB)

Name	Size	Download all
art_3A10.1186_2F1471-2105-14-358.pdf md5:d9f06241932e0be6582610278762b663	519.6 kB	Preview Download
12859_2013_6238_MOESM4_ESM.pdf md5:921a6d7362d4889e9bfc346cec4e265d	85.4 kB	Preview Download
12859_2013_6238_MOESM1_ESM.ZIP md5:915d24c5ee7beabc73d696e55820c3c5	4.4 kB	Preview Download
12859_2013_6238_MOESM2_ESM.pdf md5:f2181633b866b9db90a90cef0511b675	68.3 kB	Preview Download
12859_2013_6238_MOESM3_ESM.pdf md5:5f2c56bd09c5dc1c96098c32977760b7	95.1 kB	Preview Download

Additional details

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes