The main goal in mapping genomic DNA back to the organism's genome reference sequence is to determine sequence variation. In many ways, NGS short-read technology became viable only after the Human Genome Project and the completion of model organism references over the last decade. The classic application is to align sequencing reads to a high-quality reference genome, such as the human genome reference or a bacterial genome reference. Experiments requiring read alignment solutions are manifold. Solving the read alignment problem opens up a multitude of possible applications. We focus on read alignment, which is often the most demanding computational problem tackled in sequencing studies. In this review, we sift through and summarize this large body of work and distill essential points relevant to a wide audience of both method developers and practitioners. Some primarily address efficiency, whereas others address scalability, accuracy, or interpretability. Some methods are geared toward certain sequencing technologies, whereas others are more general. These efforts originated from various fields of study-often computer science, but sometimes statistics and mathematics. Since then, a large number of publications have described new algorithms and methods to solve this problem.
#Next generation sequence analysis software
In the early days of NGS and sequencing by synthesis, researchers realized that the sequence alignment software tools popular at the time were simply not efficient enough to analyze NGS-scale data sets, and were also not engineered for the problem.
This knowledge then enables investigators to address relevant scientific questions, such as how a particular genome differs from a reference genome, or which genes or isoforms are differentially expressed between two conditions. As more reads are traced back to their original locations, a clearer picture emerges of which DNA sequences are present in the biological specimen and in what abundance. Because of this explosion in the number of sequences, a large fraction of the computational effort spent analyzing biological data is now dedicated to determining where these reads came from in the sample and how they fit together. Modern microarrays, by contrast, produce only on the order of millions of decimal intensity values per experiment. A typical sequencing experiment now yields billions of snippets of DNA characters (reads) that are sampled from the DNA molecules in a biological specimen. The advent of next-generation sequencing (NGS, also known as high-throughput sequencing or deep sequencing) in 2005 led to an explosion in the amount of sequencing data.