Using Mate Paired Reads and Paired End in Soap De Novo
Abstract
Motivation: Many de novo genome assemblers have been proposed recently. The basis for almost existing methods relies on the de bruijn graph: a complex graph structure that attempts to embrace the entire genome. Such graphs can be prohibitively large, may fail to capture subtle information and is difficult to be parallelized.
Result: We present a method that eschews the traditional graph-based arroyo in favor of a unproblematic iii′ extension approach that has potential to exist massively parallelized. Our results show that it is able to obtain assemblies that are more contiguous, complete and less error prone compared with existing methods.
Availability: The software parcel can be found at http://www.comp.nus.edu.sg/~bioinfo/peasm/. Alternatively information technology is available from authors upon asking.
Contact:ksung@comp.nus.edu.sg; sungk@gis.a-star.edu.sg
Supplementary information: Supplementary information are available at Bioinformatics online.
1 INTRODUCTION
De novo genome assembly has been a fundamental problem in bioinformatics since the advent of DNA sequencing. The 2d-generation sequencing technologies such as Illumina Solexa and ABI SOLiD take introduced a new sense of vigor to the field. The curt length of the sequences coupled with high coverage and loftier level of noise has transformed de novo assembly to a tractable yet challenging proposition. The ease at which paired-end read libraries can be generated on these platforms is an added advantage.
A number of works accept been proposed to assemble short reads. The showtime few de novo assemblers adult to handle high-throughput curt reads were based on base-by-base 3′ extension. SSAKE, VCAKE and SHARCGS ( Dohm et al., 2007; Jeck et al., 2007; Warren et al., 2007) are examples using this principle. To resolve ambiguities, these methods adapted trivial heuristics such equally 'selecting the base with maximum overlap' or 'selecting the base with the highest consensus'. Such arbitrary criteria results in substandard assemblies that were oft a compromise between contiguity and error rate. Furthermore, the approaches were not scalable to handle medium or large genomes; therefore, their utilize is restricted to assembling BAC clones or pocket-size leaner genomes. They were as well non designed to make use of paired-cease reads, thus profoundly limiting their usefulness in assembling loftier-throughput data.
The more practical approaches for assembling high-throughput short reads have spawned based on de Bruijn graph arroyo. Velvet (Zerbino and Birney, 2008) is peradventure the most widely used method for de novo genome associates today. It is very fast in execution, fairly memory efficient and produces reasonably accurate assemblies. Similar to all other methods based on de Bruijn graph, Velvet requires the unabridged genome to exist stored in a graph construction. In the presence of noise, the graph may exist too large to be stored on system memory. Furthermore, resulting assembly generated from Velvet tends to comprise many errors at small echo regions. Another approach, Euler-USR ( Chaisson et al., 2009) is very like in concept to Velvet, but employs more than sophisticated fault detection and correction steps. However, in practice, we noted Velvet produces more than contiguous and complete assemblies in comparing with Euler-USR. Both Velvet and Euler-USR accept full advantage of paired-stop read libraries.
One of the major shortcomings of de Bruijn graph approaches is the inability to parallelize the assembly process. This is a critical requirement equally many powerful computers use multiple processors where numerous threads can be run seamlessly in parallel. Introduction of Abyss ( Simpson et al., 2009) tackled this issue. The core assembly algorithm of Abyss is very similar to that of Velvet, only it allows de Bruijn graph to exist distributed across multiple cores/nodes, and each core/node can operate on the graph independently to a certain extent. The assembly effect of Abyss is similar to that of Velvet. Nevertheless, we noticed that when executed in parallel in a multi-cadre single calculator, Abyss does not offer any advantage over Velvet in term of execution fourth dimension or memory usage. To utilize Abyss efficiently, it requires a multi-node computing cluster that may seem a disadvantage in an era where computers are increasingly fabricated faster by calculation more cores within a unmarried CPU. SOAPdenovo ( Li et al., 2010) addressed many of these problems by introducing a de Bruijn graph-based method that can seamlessly takes advantage of multi-core systems.
Allpaths/Allpaths2 ( Butler et al., 2008; MacCallum et al., 2009) appears to be the most accurate method at present. It introduces an interesting hybrid approach where the genome is withal stored as a large graph; notwithstanding, the graph is separated into different segments and assembly of these segments tin exist carried independently. This makes it possible to run some stages of Allpaths algorithm in parallel. The loftier accurateness of Allpaths is contributed by the fact that it tries all possible ways to assemble every segments; nevertheless, this comes at a tremendous cost in terms of fourth dimension and memory usage, and therefore it will non augment well for larger genomes.
We propose the method PE-Assembler that is capable of handling big datasets and produces highly contiguous and accurate assemblies within reasonable fourth dimension. Our approach is based on uncomplicated three′ extension approach and does not involve representing the entire genome in the grade of a graph. Fundamentally, it is similar to other 3′ extension approaches such equally SSAKE, VCAKE and SHARCGS. However, it improves upon such early approaches in multiple ways. The extensive employ of paired-end reads ensures that the dataset is localized within the region. Hence, our method can be run in parallel to greatly speedup the execution while staying within reasonable system requirements. Ambiguities are resolved using a multiple path extension approach, which takes into account sequence coverage, support from multiple paired libraries and more than subtle data such every bit the span distribution of the paired-finish reads.
two METHODS
Paired-end reads are also known every bit paired reads or mate pairs (depending on some technical differences) in unlike literature. Essentially, they all refer to a pair of short reads that originates from five′ to iii′ ends of a DNA fragment whose length is known approximately. The length of the fragment is referred to equally the insert size. For every paired-end read, its two reads are called the mates of each other. The length of each read is denoted equally ReadLength. It could be of any length from 25 to 100 bp. The insert size is not verbal. It may vary from MinSpan to MaxSpan.
Our program is called PE-Assembler, which aims to reconstruct the sample genome from a paired-stop read library. PE-Assember can also have multiple paired-end read libraries of unlike insert sizes, which can facilitate to resolve ambiguities that cannot be conclusively resolved using a single paired-end read library.
PE-Assembler is fundamentally based on iii′ overlap extension, like to SSAKE and VCAKE. The procedure is illustrated in Figure 1. Given a sequence, PE-Assembler extracts all reads whose prefix aligns with the suffix of the sequence. Nosotros define this equally an overlap. The suffix of each read, which overhangs from the three′ of the sequence, forms a feasible extension to the contig. If in that location is a articulate consensus for a single base, so that base of operations is appended to the end of the sequence and the process is iterated. Multiple feasible extensions are handled differently in various stages of the algorithm and are described in following sections.
Fig. ane.
Overview of 3′ overlap extension. Both t and g are feasible extensions.
Fig. one.
Overview of iii′ overlap extension. Both t and k are feasible extensions.
PE-Assembler is implemented as a serial of five steps, which are briefly described as follows (besides see Supplementary Fig. 1). First, the read screening step selects a set of reads (chosen 'solid' reads) as starting points for extending the assembly. This pace specifically avoids reads containing sequencing errors and reads occurring in repeat regions in the genome. The second step then extends these 'solid' reads using unmarried cease reads to make them longer than MaxSpan. Those successfully extended regions are called seeds. Seeds are long enough for extension using paired-end reads. Our third step (called contig extension) tries to extend all these seeds using paired-end reads. The resulting sequences are chosen contigs. The 4th footstep links those contigs using paired-end reads to form scaffolds (i.e. ordered prepare of contigs with gaps in betwixt). Finally, the final step tries to fill-in the gaps in between scaffolded contigs. Below, nosotros will detail the five steps.
2.i Read screening
Many curt read assemblers perform error correction/detection steps prior to the assembly. While it is more often than not effective in detecting and fixing random sequencing errors, it treats each read equally a single read and therefore fails to apply the pairing data. This may consequence in overcorrecting the reads coming from low coverage regions every bit the actual location of the paired-end read is not taken into account.
Our approach does not perform mistake correction. However, nosotros require a pool of error-free and non-repetitive reads as starting points for the seed building step (Section 2.2). These reads are isolated by conveying out a read screening step.
The thought behind the screening step is similar to the kmer frequency based mistake correction method proposed by Pevzner et al. (2001). Its details are as follows. A kmer is a length k Dna sequence. Provided the genome is sampled at a high coverage, a one thousandmer that occurs in the genome is probable to occur multiple times in the input reads. Suppose a detail mmer occurs once (or very sparingly) in the input reads, such yardmer is unlikely to occur in the target genome and is probable to be a issue of a sequencing error. Similarly, if a kmer occurs at a higher frequency than expected, we can conclude that it may have originated from a repeat region in the genome. A mmer that is expected to occur in the actual genome is called a 'solid' grandmer while a yardmer that is expected to occur inside a repeat region is called a 'echo' kmer. To classify a read as either a solid kmer or a repeat kmer, we browse the unabridged dataset of reads to extract the set of thoumers and their frequencies. A kmer frequency histogram is plotted. And then, nosotros identify the solid kmer threshold and the echo gmer threshold from the troughs on either side of the main meridian (Fig. 2). A read is said to be 'solid' if the frequencies of all its kmers are college than the solid grandmer threshold and lower than the repeat thousandmer threshold. Only solid reads are chosen every bit the offset points for the next step. Note that this stage does not discard or right whatever data. The entire dataset is used in the assembly equally it is.
Fig. two.
The kmer frequency histogram. Nosotros tin determine the solid chiliadmer cutoff and repeat gmer cutoff from the 2 troughs.
Fig. 2.
The kmer frequency histogram. We can determine the solid grandmer cutoff and repeat kmer cutoff from the 2 troughs.
2.2 Seed building
A 'seed' is defined as a contiguous region in the target genome which is of length at to the lowest degree MaxSpan. To assemble a seed, we start with an unused solid read as the initial seed and carry out 3′ overlap extension equally described higher up. Nevertheless, due to the presence of pocket-size repeats or sequencing errors, in that location may exist multiple feasible candidates as the next 3′ base.
Ambiguities arising due to repeats can be resolved with the help of paired-end reads. Throughout the seed associates, we maintain a pool of reads whose mates map on to the current seed. In instance of whatever ambiguity, for every read overlapping with the seed, nosotros check if its mate overlaps with whatever reads in the maintained pool (Fig. 3). Those without overlap support are assumed to be noise and thus discarded.
Fig. 3.
Resolving ambiguities in the seed building step: suppose the current seed can exist extended by two possible candidates 'a' and 'g'. Assume that, for reads extending 'g', their mates overlap with the reads in the pool, while the reads extending 'a' exercise not accept such support. Then, we tin safely select candidate 'g' for extension.
Fig. three.
Resolving ambiguities in the seed building footstep: suppose the current seed can exist extended by 2 possible candidates 'a' and 'k'. Assume that, for reads extending 'g', their mates overlap with the reads in the pool, while the reads extending 'a' do non have such support. Then, nosotros can safely select candidate 'thou' for extension.
The above method cannot resolve ambiguities arising due to sequencing errors. In such case, nosotros extend every candidate base of operations up to a altitude of ReadLength. Whatever extension path arising due to sequencing errors is likely to be terminated prematurely. If just one candidate path can reach the full altitude, then that path is assumed to be the correct extension.
At any stage, if in that location is no candidate for extension (likely due to low sequencing coverage) or multiple candidates for extensions (possibly due to longer repeats), the extension is terminated. Seed will then be extended from the other side. The extension will be 'successfully' terminated once the seed reaches the length of MaxSpan.
For every successfully terminated seed, a seed verification footstep is performed to ensure that the seed represents a face-to-face region in the target genome. Precisely, to verify the 3′ end of a seed, we crave at least one paired-end read overlaps with 3′ end of the seed (Fig. iv). Similarly, we can verify the 5′ end of a seed. All verified reads are immediately subjected to contig extension step (Department 2.3). Seeds which fail the verification step are discarded.
Fig. four.
A paired-stop read is said to overlap the 3′ end of a seed if the 3′ read of the paired-end read overlaps the 3′ terminate of the seed and the 5′ read maps on the seed within the expected region, equally adamant by MinSpan and MaxSpan of the library.
Fig. 4.
A paired-cease read is said to overlap the three′ end of a seed if the 3′ read of the paired-finish read overlaps the 3′ end of the seed and the 5′ read maps on the seed inside the expected region, as determined past MinSpan and MaxSpan of the library.
ii.3 Contig extension
The contig extension step aims to extend each verified seed to form a longer contig iteratively. Again, this step relies on overlap extension to elongate the electric current contig; but with some differences. Since a contig is longer than MaxSpan, instead of using single reads to extend the contig, we try to place feasible extensions from paired-stop reads that overlap with the contig. Moreover, when no paired-end read is found overlapping with the contig, we place viable extensions from overlapping reads instead.
If a articulate consensus is constitute amidst the feasible extensions, then that base is appended to the terminate of the contig and procedure is repeated. Occasionally, in that location are multiple feasible candidates to extend the contig. Such scenario may arise due to three reasons. The first reason is sequencing errors. These errors can be dealt like to the seed building step. The second reason is due to short tandem repeat regions. In such example, we stop the extension and nosotros volition try to estimate the correct number of tandem repeats during the gap filling pace. The tertiary reason is due to long repeats. In such case we likewise end the extension. Note that when the echo is longer than MaxSpan, nosotros cannot theoretically resolve the ambivalence using the given paired-end read data. A paired-terminate read library of longer insert size is required to resolve such ambiguity.
The contig extension step is performed until we cannot extend the contig from both ends. So, the resultant contig is kept to be used in scaffolding.
2.4 Scaffolding
The objective of the scaffolding step is to find the correct ordering of the resulting prepare of contigs.
As the scaffolding footstep is very sensitive to the presence of repeat regions, the kickoff step is to demarcate all repeat regions within assembled contigs. In this pace, all individual reads are mapped back to the contigs and read density across all the contigs is calculated. The mode of the read density is assumed to be the expected read coverage across the genome. Any region with read density higher than 1.five times of the mode is considered every bit a echo region. Any reads mapped onto such echo region are discarded.
During this step, additional statistics such as boilerplate span and standard deviation for each library is calculated. This information is used during the gap filling phase.
For the scaffolding step, we merely consider the paired-end reads whose ii reads map uniquely to two different contigs. Such a mapping is referred as a chimeric mapping. Although nosotros cannot estimate the exact span of a chimeric mapping, the minimum bridge for a chimeric mapping can be calculated past the distance that it has covered on the two contigs (Fig. 5). Every paired-cease read mapping whose minimum span exceeds MaxSpan is discarded.
Fig. 5.
Minimum span distance of this chimeric mapping is a+b. Actual span may vary depending on the gap size between contigs X and Z.
Fig. v.
Minimum bridge distance of this chimeric mapping is a+b. Actual span may vary depending on the gap size between contigs 10 and Z.
Two contigs of specific orientation are said to be linked by an edge if there is at least a certain number of chimeric mappings between the 2 contigs in that orientation. The weight of the edge is the full number of such chimeric mappings, normalized past the total number of paired-end reads in the library. The maximum gap size is estimated by subtracting MaxSpan by the boilerplate of minimum spans of all chimeric paired-end reads of that edge. Multiple fragment libraries of different insert sizes may be used at this point. Each library volition result in its own distinct set of edges.
A potential scaffold is a linear ordering of contigs. An edge betwixt two contigs X and Z is deemed satisfied if both contigs X and Z occur within the aforementioned scaffold in a right orientation and the full length of all contigs between Ten and Z is less than the maximum gap size estimated past that edge; otherwise, if X and Z cannot exist arranged so that they are within the expected span, the edge is said to be contradicted. The score for each scaffold is calculated by totaling the weights of all satisfied edges and subtracting the weights of all contradicted edges.
The aim of the scaffolding algorithm is to produce a set of scaffolds such that the in a higher place score is maximized. Notwithstanding, exact solution to this is computationally prohibitive. Therefore, we employ the following greedy heuristic approach.
The scaffolding procedure starts by selecting a contig at random as the initial scaffold. The procedure extends the scaffold iteratively by including contigs to the correct. A contig X is said to be a right neighbor of a scaffold if there exists some contig Z in the scaffold such that (Z, Ten) is an border and the total length of contigs to the right of Z in the scaffold is less than the maximum gap size of (Z, X). All right neighbors of the scaffold are potential candidates to extend the scaffold from its 3′ end. Each candidate correct neighbor is temporally added to the 3′ finish of the scaffold, and all permutations of remaining right neighbors are appended after it to obtain multiple possible orderings. Each such potential ordering is evaluated. The candidate right neighbor that results in the ordering with the highest score is permanently added to the 3′ terminate of the scaffold. This process is repeated until any of the post-obit occurs: neighborhood is empty; best ordering score is negative or the current region of the scaffolding has already been ordered elsewhere. If scaffolding is terminated from the three′ end, we try to extend the scaffold from the five′ end. Once both ends are not extendable, we obtain i scaffold and the entire procedure is repeated with an unused contig as the start point to place other scaffolds.
ii.5 Gap filling
The scaffolding step reports a list (or lists) of contigs in the aforementioned social club equally they would be in the actual genome. The side by side contigs are commonly separated by an unknown sequence. The objective of the gap filling step is to assemble the gap region between two adjacent contigs to form a longer contig. Annotation that the length of the gap can exist estimated using paired-end reads, which map across two side by side contigs.
For every read that occurs in the gap, its mate must map to either the left or the right contig of the gap. Hence, the gap can be filled in using such reads. Every bit we are dealing with a localized set of data, gap filling step tin employ a less stringent minimum overlap length, thus facilitating assembly of low-coverage regions.
A key divergence between the gap filling stride and the seed building step is that the onetime tin can resolve convoluted repeat regions past exploiting span data of paired-finish reads to a greater degree. Similar to seed building step, the assembly is carried out using overlap extension. Whenever in that location are multiple extension paths due to multiple candidate bases, each path is extended upwardly to a altitude of ReadLength. Moreover, for each extension path, nosotros can obtain the span histogram of all paired-stop reads, which map on this extension path. The distribution of this 'perceived' span for each extension path is compared against the bridge distribution of the entire library. The span distribution of the correct extension will be inline with that of the entire library, whereas distributions of incorrect extensions will exhibit a noticeable shift. This idea is demonstrated in Effigy 6.
Fig. 6.
Utilize of boilerplate span and standard deviation to resolve ambiguities in Staphylococcus aureus assembly: (A) Reference sequence from region in question. Bolded segments are identical. Both 'a' and 't' seems a valid choice after that region. (B) Sequence overlap shows both 'a' and 't' as potential candidates. Both paths are extended up to a distance of 'TagLength'. (C) Illustration of the two different extensions. For each path, spans of paired-end reads mapping beyond the branching signal is kept. Spans resulting from the incorrect assembly are noticeably shorter due to missing region. (D) Histograms of perceived paired-stop read bridge of two dissimilar paths and that of the entire library. Notation that span distribution of correct path closely follows that of the library. Path with the span distribution closest to the library span distribution is called.
Fig. six.
Use of average span and standard divergence to resolve ambiguities in Staphylococcus aureus assembly: (A) Reference sequence from region in question. Bolded segments are identical. Both 'a' and 't' seems a valid choice after that region. (B) Sequence overlap shows both 'a' and 't' as potential candidates. Both paths are extended up to a distance of 'TagLength'. (C) Analogy of the two different extensions. For each path, spans of paired-end reads mapping across the branching point is kept. Spans resulting from the incorrect assembly are noticeably shorter due to missing region. (D) Histograms of perceived paired-end read span of ii different paths and that of the entire library. Notation that span distribution of right path closely follows that of the library. Path with the span distribution closest to the library bridge distribution is chosen.
The adjacent contigs whose gap can exist successfully bridged are merged as a single contig. The resulting gear up of contigs from this footstep represents the final output of the assembly.
2.half dozen Parallelization
This section discusses the issue of parallelization for the five steps. For the read screening pace, since it is largely disk leap, parallelizing this step does non improve the operation noticeably. All remaining steps can be run as threads on multiple cores sharing the same memory space almost independent of each other.
For the seed building step and the contig extension step, the solid reads and the seeds are distributed to different threads for parallel execution. Provided the genome is reasonably large and the number of threads is not impractically high, we tin can presume most of the threads assemble different regions of the genome. Every thread will mark the reads which are and so far used in the associates. Periodically, every thread will refer to this information to detect if the region it is currently assembling has been previously assembled by other threads. If a read is detected to be marked by other threads, the thread will rewind the assembly to the last unmarked read and terminated.
Scaffolding footstep involves mapping back each paired-end read to assembled contigs and forming a graph comprising of contigs as nodes and 'chimeric' paired-end reads as edges. The graph edifice step is carried out in parallel. Actual scaffolding is carried out in a single thread; even so, this step is not very time consuming.
Gap filling can exist executed in parallel since gap filling is localized and is independent from one another.
For the entire assembly process, the time taken is roughly inversely proportional to the number of cores/threads utilized. (Please refer to the Section 3.)
3 RESULTS
3.i Simulated data
To evaluate the goodness of our approach, experiments were carried out on simulated datasets based on Escherichia coli and Schizosaccharomyces pombe reference genomes. Iii libraries of paired-end reads with varying 'fragment lengths' were simulated from each genome (Table ane); a short fragment library of average span 200 bp, a medium fragment library of boilerplate span 1000 bp and a long fragment library of average span 10 000 bp. All reads were assumed to be 35 bp long. Precise criteria for simulation are detailed in Supplementary section. For comparison, we executed all popular de novo assembly programs such as Allpaths2, Velvet, ABySS and SOAPdenovo in addition to PE-Assembler (denoted by 'PA' in tables). Each program was run with multiple parameters and the best result for each program is quoted below. The summary results for all experiments and parameters are available in Supplementary section.
Tabular array 1.
Details of the fake dataset
Organism | Escherichia coli | Schizosaccharomyces pombe | HG 18-Chr x | ||||||
---|---|---|---|---|---|---|---|---|---|
No. of contigs/chromosomes | 1 | 3 | i | ||||||
Genome length (bp) | 4 639 658 | 12 571 820 | 135 374 737 | ||||||
Library | 200 bp | 1 kb | ten kb | 200 bp | 1 kb | 10 kb | 200 bp | i kb | x kb |
Read length (bp) | 35 | 35 | 35 | 35 | 35 | 35 | 75 | 75 | 75 |
Boilerplate insert size (bp) | 235 | 1035 | 10035 | 235 | 1035 | 10035 | 275 | 1075 | 10075 |
Insert size range (average ± bp) | ±forty | ±200 | ±2000 | ±twoscore | ±200 | ±2000 | ±40 | ±200 | ±2000 |
No. of paired reads (millions) | 3.31 | 3.31 | 3.31 | 8.98 | 8.98 | eight.98 | 45.12 | nine.02 | ix.02 |
Coverage | fifty× | 50× | 50× | 50× | fifty× | 50× | 50× | 10× | 10× |
Seq. error rate, % | 2.0 | two.0 | 2.0 | 2.0 | 2.0 | ii.0 | 2.0 | 2.0 | two.0 |
Ligation error rate, % | 0.0 | 2.0 | 2.0 | 0.0 | 2.0 | ii.0 | 0.0 | ii.0 | ii.0 |
Organism | Escherichia coli | Schizosaccharomyces pombe | HG xviii-Chr 10 | ||||||
---|---|---|---|---|---|---|---|---|---|
No. of contigs/chromosomes | 1 | 3 | 1 | ||||||
Genome length (bp) | four 639 658 | 12 571 820 | 135 374 737 | ||||||
Library | 200 bp | 1 kb | 10 kb | 200 bp | 1 kb | x kb | 200 bp | one kb | x kb |
Read length (bp) | 35 | 35 | 35 | 35 | 35 | 35 | 75 | 75 | 75 |
Boilerplate insert size (bp) | 235 | 1035 | 10035 | 235 | 1035 | 10035 | 275 | 1075 | 10075 |
Insert size range (average ± bp) | ±40 | ±200 | ±2000 | ±40 | ±200 | ±2000 | ±xl | ±200 | ±2000 |
No. of paired reads (millions) | 3.31 | three.31 | 3.31 | eight.98 | 8.98 | viii.98 | 45.12 | 9.02 | 9.02 |
Coverage | 50× | 50× | fifty× | 50× | l× | l× | 50× | 10× | 10× |
Seq. error rate, % | 2.0 | 2.0 | ii.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 |
Ligation fault rate, % | 0.0 | two.0 | 2.0 | 0.0 | 2.0 | 2.0 | 0.0 | 2.0 | 2.0 |
Table 1.
Details of the false dataset
Organism | Escherichia coli | Schizosaccharomyces pombe | HG xviii-Chr x | ||||||
---|---|---|---|---|---|---|---|---|---|
No. of contigs/chromosomes | i | 3 | 1 | ||||||
Genome length (bp) | 4 639 658 | 12 571 820 | 135 374 737 | ||||||
Library | 200 bp | one kb | x kb | 200 bp | 1 kb | x kb | 200 bp | 1 kb | 10 kb |
Read length (bp) | 35 | 35 | 35 | 35 | 35 | 35 | 75 | 75 | 75 |
Average insert size (bp) | 235 | 1035 | 10035 | 235 | 1035 | 10035 | 275 | 1075 | 10075 |
Insert size range (average ± bp) | ±40 | ±200 | ±2000 | ±40 | ±200 | ±2000 | ±40 | ±200 | ±2000 |
No. of paired reads (millions) | three.31 | 3.31 | iii.31 | 8.98 | 8.98 | 8.98 | 45.12 | ix.02 | ix.02 |
Coverage | 50× | 50× | 50× | 50× | l× | fifty× | l× | 10× | ten× |
Seq. mistake rate, % | 2.0 | 2.0 | ii.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 | 2.0 |
Ligation error charge per unit, % | 0.0 | 2.0 | two.0 | 0.0 | ii.0 | two.0 | 0.0 | 2.0 | 2.0 |
Organism | Escherichia coli | Schizosaccharomyces pombe | HG 18-Chr 10 | ||||||
---|---|---|---|---|---|---|---|---|---|
No. of contigs/chromosomes | one | iii | ane | ||||||
Genome length (bp) | 4 639 658 | 12 571 820 | 135 374 737 | ||||||
Library | 200 bp | 1 kb | ten kb | 200 bp | 1 kb | 10 kb | 200 bp | 1 kb | 10 kb |
Read length (bp) | 35 | 35 | 35 | 35 | 35 | 35 | 75 | 75 | 75 |
Average insert size (bp) | 235 | 1035 | 10035 | 235 | 1035 | 10035 | 275 | 1075 | 10075 |
Insert size range (boilerplate ± bp) | ±40 | ±200 | ±2000 | ±xl | ±200 | ±2000 | ±40 | ±200 | ±2000 |
No. of paired reads (millions) | three.31 | 3.31 | 3.31 | eight.98 | viii.98 | 8.98 | 45.12 | ix.02 | nine.02 |
Coverage | 50× | l× | 50× | l× | 50× | fifty× | 50× | 10× | x× |
Seq. error rate, % | 2.0 | 2.0 | two.0 | 2.0 | 2.0 | 2.0 | two.0 | 2.0 | 2.0 |
Ligation error rate, % | 0.0 | 2.0 | 2.0 | 0.0 | ii.0 | 2.0 | 0.0 | 2.0 | two.0 |
Nosotros adjusted the following approach to evaluate each assembly effect. All contigs were aligned against the reference genome using BLAT (Kent, 2002). Whatsoever contig which does not completely align to the reference genome, while assuasive for small indels and mismatches, is deemed a 'big misassembly'. To evaluate the accuracy at micro level, we segmented the reference genome into continuous, non-overlapping sequences of yard bp and check if they can be mapped on the assembly's contigs without errors. The number of error-gratis segments that tin can be mapped on the contigs is cogitating of the accuracy of the assembly. Contiguity of the assembly is measured by the N50 and N90 sizes. The abyss of the assembly is evaluated past computing the percentage of reference genome covered by the assembled contigs. The computational complexity of each assembler is measured by its running time and memory usage. These evaluation steps are detailed in the Supplementary section.
The results for imitation data are summarized in Table 2. The experiments demonstrate that PE-Assembler tin generate highly contiguous assemblies at a very low error rate using less system resources. While Velvet is fast in execution, the number of misassemblies shows that it lags behind PE-Assembler and Allpaths in terms of accuracy. Both ABySS and SOAPdenovo produces highly fragment results with relatively modest N50 sizes.
Table 2.
Comparison of fake data results
Escherichia coli | Schizosaccharomyces pombe | HG18 chr10 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PA | Velvet | Allpaths2 | Completeness | Soap | PA | Velvet | Allpaths2 | Completeness | Soap | PA | Abyss | SOAP | |
| |||||||||||||
Parameters | −minol = 25 | −yard = 27 | −1000 = 21 | −thou = 25 | −chiliad = 27 | −minol = 25 | −k = 29 | −g = 21 | −k = 25 | −k = 29 | −minol = fifty | −k = 45 | −chiliad = 31 |
−cov = car | −j = three,−n = x | −pair_num = 3 | −cov = motorcar | −j = 3 −north = x | pair_num = iii | −j = 3 | −p = 20 | ||||||
−exp = auto | −np = viii | −p = 8 | −exp = auto | −np = viii | −p = 8 | −n = 10 | |||||||
Contig statistics | |||||||||||||
No. of contigs (>200 bp) | six | 56 | 44 | 283 | 199 | 31 | 181 | 164 | 650 | 348 | 4262 | 49015 | 18238 |
Average length (kb) | 777.4 | 82606 | 107.half-dozen | 22.3 | 22.8 | 394.vii | 67.9 | 75.iii | 23.1 | 35.0 | 30.2 | ii.9 | six.six |
Maximum length (kb) | 2492.6 | 708.half dozen | 593.7 | 163.0 | 232.0 | 3519.half dozen | 856.1 | 851.0 | 297.3 | 468.7 | 403.five | 65.ii | 155.viii |
Contig N50 size (kb) | 2492.half dozen | 398.three | 373.3 | 63.eight | 49.9 | 1487.7 | 273.0 | 226.8 | 80.1 | 99.eight | 62.4 | 5.three | 13.0 |
Contig N90 size (kb) | 2146.0 | 109.9 | 115.4 | 33.ix | 12.4 | 363.half-dozen | 54.4 | 59.5 | 36.seven | xix.0 | 11.ane | 1.7 | 3.vi |
Coverage (%) | 100.00 | 99.59 | 99.85 | 99.00 | 98.87 | 97.78 | 99.35 | 98.60 | 98.38 | 98.91 | 90.89 | 92.04 | 87.14 |
Evaluation | |||||||||||||
Large misassemblies | 0 | 11 | 0 | 1 | one | 0 | 17 | 0 | 1 | iv | v | 171 | 605 |
Segment maps (%) | 99.68 | 94.74 | 99.18 | 93.31 | 93.66 | 96.42 | 94.44 | 96.83 | 92.72 | 94.31 | 86.17 | 63.61 | 32.ii |
Performance a | |||||||||||||
Total execution time (min) | 21 | ten | 227 | 43 | 5 | 101 | 40 | 734 | 98 | 11 | 748 | N/A b | 240 |
Peak retentivity usage (gb) | 2.3 | 2.9 | 29.7 | 2.9 | five.9 | 4.5 | 7.vii | 66 | half dozen | eight.1 | 15.1 | N/A b | 48.0 |
Escherichia coli | Schizosaccharomyces pombe | HG18 chr10 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PA | Velvet | Allpaths2 | Abyss | Lather | PA | Velvet | Allpaths2 | Completeness | SOAP | PA | Abyss | SOAP | |
| |||||||||||||
Parameters | −minol = 25 | −chiliad = 27 | −grand = 21 | −yard = 25 | −k = 27 | −minol = 25 | −k = 29 | −k = 21 | −1000 = 25 | −k = 29 | −minol = fifty | −k = 45 | −chiliad = 31 |
−cov = motorcar | −j = iii,−n = x | −pair_num = three | −cov = auto | −j = 3 −n = 10 | pair_num = 3 | −j = 3 | −p = 20 | ||||||
−exp = auto | −np = viii | −p = eight | −exp = automobile | −np = 8 | −p = 8 | −n = 10 | |||||||
Contig statistics | |||||||||||||
No. of contigs (>200 bp) | 6 | 56 | 44 | 283 | 199 | 31 | 181 | 164 | 650 | 348 | 4262 | 49015 | 18238 |
Average length (kb) | 777.4 | 82606 | 107.6 | 22.3 | 22.eight | 394.7 | 67.nine | 75.3 | 23.1 | 35.0 | 30.2 | 2.9 | 6.6 |
Maximum length (kb) | 2492.6 | 708.6 | 593.7 | 163.0 | 232.0 | 3519.vi | 856.ane | 851.0 | 297.three | 468.seven | 403.5 | 65.2 | 155.viii |
Contig N50 size (kb) | 2492.vi | 398.3 | 373.3 | 63.viii | 49.ix | 1487.7 | 273.0 | 226.8 | 80.1 | 99.8 | 62.4 | 5.three | thirteen.0 |
Contig N90 size (kb) | 2146.0 | 109.9 | 115.4 | 33.9 | 12.4 | 363.6 | 54.4 | 59.v | 36.7 | 19.0 | 11.1 | one.7 | iii.6 |
Coverage (%) | 100.00 | 99.59 | 99.85 | 99.00 | 98.87 | 97.78 | 99.35 | 98.60 | 98.38 | 98.91 | 90.89 | 92.04 | 87.14 |
Evaluation | |||||||||||||
Large misassemblies | 0 | 11 | 0 | 1 | one | 0 | 17 | 0 | i | 4 | 5 | 171 | 605 |
Segment maps (%) | 99.68 | 94.74 | 99.eighteen | 93.31 | 93.66 | 96.42 | 94.44 | 96.83 | 92.72 | 94.31 | 86.17 | 63.61 | 32.2 |
Performance a | |||||||||||||
Total execution time (min) | 21 | x | 227 | 43 | five | 101 | xl | 734 | 98 | 11 | 748 | N/A b | 240 |
Peak memory usage (gb) | 2.3 | 2.9 | 29.7 | two.9 | v.9 | 4.5 | 7.7 | 66 | half dozen | viii.one | 15.1 | N/A b | 48.0 |
a Escherichai coli and S.pombe datasets were run using viii threads for PE-Assembler, SOAPdenovo and Completeness. HG18 Chr10 dataset was run using 20 threads for PE-Assembler and SOAPdenovo. For this dataset ABySS was run across four nodes in a cluster, each running two separate threads.
bExecution time and retentivity usage not available for ABySS. Run into Supplementary Material.
Table ii.
Comparing of fake data results
Escherichia coli | Schizosaccharomyces pombe | HG18 chr10 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PA | Velvet | Allpaths2 | Abyss | Soap | PA | Velvet | Allpaths2 | Abyss | SOAP | PA | Abyss | Lather | |
| |||||||||||||
Parameters | −minol = 25 | −k = 27 | −g = 21 | −k = 25 | −k = 27 | −minol = 25 | −k = 29 | −k = 21 | −1000 = 25 | −k = 29 | −minol = 50 | −m = 45 | −k = 31 |
−cov = auto | −j = 3,−n = 10 | −pair_num = 3 | −cov = auto | −j = three −due north = 10 | pair_num = 3 | −j = three | −p = xx | ||||||
−exp = car | −np = 8 | −p = eight | −exp = auto | −np = viii | −p = 8 | −n = x | |||||||
Contig statistics | |||||||||||||
No. of contigs (>200 bp) | vi | 56 | 44 | 283 | 199 | 31 | 181 | 164 | 650 | 348 | 4262 | 49015 | 18238 |
Average length (kb) | 777.4 | 82606 | 107.six | 22.three | 22.viii | 394.7 | 67.9 | 75.3 | 23.ane | 35.0 | thirty.2 | two.9 | 6.half dozen |
Maximum length (kb) | 2492.six | 708.half dozen | 593.seven | 163.0 | 232.0 | 3519.6 | 856.1 | 851.0 | 297.iii | 468.vii | 403.5 | 65.ii | 155.viii |
Contig N50 size (kb) | 2492.6 | 398.3 | 373.3 | 63.8 | 49.9 | 1487.7 | 273.0 | 226.8 | eighty.i | 99.eight | 62.4 | 5.3 | 13.0 |
Contig N90 size (kb) | 2146.0 | 109.9 | 115.4 | 33.nine | 12.4 | 363.6 | 54.four | 59.v | 36.7 | xix.0 | 11.ane | 1.7 | 3.six |
Coverage (%) | 100.00 | 99.59 | 99.85 | 99.00 | 98.87 | 97.78 | 99.35 | 98.60 | 98.38 | 98.91 | 90.89 | 92.04 | 87.xiv |
Evaluation | |||||||||||||
Large misassemblies | 0 | 11 | 0 | one | ane | 0 | 17 | 0 | 1 | iv | 5 | 171 | 605 |
Segment maps (%) | 99.68 | 94.74 | 99.18 | 93.31 | 93.66 | 96.42 | 94.44 | 96.83 | 92.72 | 94.31 | 86.17 | 63.61 | 32.ii |
Functioning a | |||||||||||||
Total execution fourth dimension (min) | 21 | 10 | 227 | 43 | 5 | 101 | forty | 734 | 98 | 11 | 748 | N/A b | 240 |
Peak memory usage (gb) | 2.iii | 2.ix | 29.7 | 2.9 | 5.9 | 4.5 | 7.7 | 66 | six | viii.1 | xv.1 | North/A b | 48.0 |
Escherichia coli | Schizosaccharomyces pombe | HG18 chr10 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PA | Velvet | Allpaths2 | Abyss | SOAP | PA | Velvet | Allpaths2 | Abyss | SOAP | PA | Abyss | Soap | |
| |||||||||||||
Parameters | −minol = 25 | −k = 27 | −k = 21 | −k = 25 | −k = 27 | −minol = 25 | −g = 29 | −m = 21 | −thousand = 25 | −thousand = 29 | −minol = 50 | −k = 45 | −thou = 31 |
−cov = auto | −j = 3,−n = 10 | −pair_num = three | −cov = machine | −j = 3 −n = ten | pair_num = iii | −j = 3 | −p = xx | ||||||
−exp = automobile | −np = 8 | −p = 8 | −exp = auto | −np = 8 | −p = viii | −n = 10 | |||||||
Contig statistics | |||||||||||||
No. of contigs (>200 bp) | 6 | 56 | 44 | 283 | 199 | 31 | 181 | 164 | 650 | 348 | 4262 | 49015 | 18238 |
Average length (kb) | 777.4 | 82606 | 107.half dozen | 22.three | 22.8 | 394.7 | 67.ix | 75.three | 23.ane | 35.0 | thirty.2 | 2.ix | half-dozen.vi |
Maximum length (kb) | 2492.6 | 708.vi | 593.seven | 163.0 | 232.0 | 3519.6 | 856.1 | 851.0 | 297.3 | 468.7 | 403.5 | 65.2 | 155.8 |
Contig N50 size (kb) | 2492.half dozen | 398.iii | 373.iii | 63.8 | 49.nine | 1487.vii | 273.0 | 226.viii | 80.1 | 99.viii | 62.4 | 5.3 | thirteen.0 |
Contig N90 size (kb) | 2146.0 | 109.9 | 115.4 | 33.nine | 12.four | 363.6 | 54.iv | 59.5 | 36.7 | 19.0 | 11.1 | ane.7 | 3.half-dozen |
Coverage (%) | 100.00 | 99.59 | 99.85 | 99.00 | 98.87 | 97.78 | 99.35 | 98.60 | 98.38 | 98.91 | 90.89 | 92.04 | 87.xiv |
Evaluation | |||||||||||||
Large misassemblies | 0 | xi | 0 | 1 | 1 | 0 | 17 | 0 | 1 | 4 | v | 171 | 605 |
Segment maps (%) | 99.68 | 94.74 | 99.18 | 93.31 | 93.66 | 96.42 | 94.44 | 96.83 | 92.72 | 94.31 | 86.17 | 63.61 | 32.two |
Functioning a | |||||||||||||
Total execution fourth dimension (min) | 21 | 10 | 227 | 43 | 5 | 101 | forty | 734 | 98 | 11 | 748 | Due north/A b | 240 |
Top memory usage (gb) | 2.three | 2.ix | 29.7 | 2.9 | 5.nine | 4.v | 7.7 | 66 | 6 | 8.i | 15.one | Due north/A b | 48.0 |
a Escherichai coli and S.pombe datasets were run using eight threads for PE-Assembler, SOAPdenovo and ABySS. HG18 Chr10 dataset was run using xx threads for PE-Assembler and SOAPdenovo. For this dataset Completeness was run beyond four nodes in a cluster, each running two split up threads.
bExecution fourth dimension and retentiveness usage not bachelor for Completeness. See Supplementary Textile.
To demonstrate that PE-Assembler is scalable to handle large genomes, we false three paired-end read libraries of aforementioned fragment sizes from chromosome x of HG18 and assembled using PE-Assembler. PE-Assembler tin encompass 90% of the original chromosome with N50 size exceeding sixty 000. ABySS and SOAPdenovo produces a large number of contigs with very depression N50 value. We failed to execute both Allpaths2 and Velvet for this dataset due to their high retention usage.
Furthermore, nosotros fake four libraries of 500 bp fragment paired-cease data of four unlike read lengths at threescore× coverage to assess the touch on of increase in read length on PE-Assembler. The results (Tabular array 3) show that PE-Assembler benefits from increase in read length and compares favorably confronting Velvet for all read lengths.
Table 3.
Performance of PE-Assembler using different read lengths
Escherichia coli | ||||||||
---|---|---|---|---|---|---|---|---|
35 bp reads | 50 bp reads | 75 bp reads | 100 bp reads | |||||
PA | Velvet | PA | Velvet | PA | Velvet | PA | Velvet | |
No. of contigs (>600 bp) | 73 | 90 | 64 | 89 | 67 | 83 | 53 | eighty |
Contig N50 size (kb) | 124.vii | 111.8 | 133.0 | 132.vi | 144.0 | 132.eight | 178.3 | 171.8 |
Contig N90 size (kb) | 31.ix | 35.0 | 35.two | 31.v | 35.2 | forty.1 | 41.4 | xl.1 |
Big misassembiles | 0 | 1 | 0 | 1 | 1 | 0 | 1 | i |
Coverage (%) | 98.78 | 98.76 | 99.eleven | 99.08 | 98.91 | 99.37 | 98.73 | 99.36 |
Execution time a (min) | vii | 6 | 7 | six | half dozen | half dozen | vi | 7 |
Acme memory usage (one thousand) | one.4 | 1.seven | 1.four | 2.one | 1.3 | 2.seven | 1.3 | three.5 |
Escherichia coli | ||||||||
---|---|---|---|---|---|---|---|---|
35 bp reads | fifty bp reads | 75 bp reads | 100 bp reads | |||||
PA | Velvet | PA | Velvet | PA | Velvet | PA | Velvet | |
No. of contigs (>600 bp) | 73 | ninety | 64 | 89 | 67 | 83 | 53 | 80 |
Contig N50 size (kb) | 124.7 | 111.8 | 133.0 | 132.6 | 144.0 | 132.8 | 178.iii | 171.8 |
Contig N90 size (kb) | 31.9 | 35.0 | 35.2 | 31.v | 35.2 | 40.one | 41.4 | 40.1 |
Large misassembiles | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 |
Coverage (%) | 98.78 | 98.76 | 99.11 | 99.08 | 98.91 | 99.37 | 98.73 | 99.36 |
Execution time a (min) | 7 | 6 | seven | 6 | half dozen | 6 | half dozen | vii |
Top retentivity usage (g) | one.4 | 1.seven | ane.4 | two.one | 1.three | two.seven | 1.iii | three.5 |
aPE-Assembler was run using twenty parallel threads.
Velvet was run with following thou-values, respectively, 23, 31, 43 and 47. –cov_cutoff and –exp_cov was ready to car.
Table 3.
Performance of PE-Assembler using different read lengths
Escherichia coli | ||||||||
---|---|---|---|---|---|---|---|---|
35 bp reads | fifty bp reads | 75 bp reads | 100 bp reads | |||||
PA | Velvet | PA | Velvet | PA | Velvet | PA | Velvet | |
No. of contigs (>600 bp) | 73 | 90 | 64 | 89 | 67 | 83 | 53 | 80 |
Contig N50 size (kb) | 124.7 | 111.eight | 133.0 | 132.6 | 144.0 | 132.eight | 178.3 | 171.8 |
Contig N90 size (kb) | 31.nine | 35.0 | 35.2 | 31.5 | 35.ii | 40.1 | 41.4 | 40.1 |
Large misassembiles | 0 | 1 | 0 | i | 1 | 0 | ane | 1 |
Coverage (%) | 98.78 | 98.76 | 99.11 | 99.08 | 98.91 | 99.37 | 98.73 | 99.36 |
Execution time a (min) | 7 | half dozen | seven | half dozen | 6 | 6 | six | 7 |
Peak memory usage (g) | one.4 | i.seven | i.4 | 2.ane | ane.3 | 2.seven | 1.iii | iii.5 |
Escherichia coli | ||||||||
---|---|---|---|---|---|---|---|---|
35 bp reads | l bp reads | 75 bp reads | 100 bp reads | |||||
PA | Velvet | PA | Velvet | PA | Velvet | PA | Velvet | |
No. of contigs (>600 bp) | 73 | ninety | 64 | 89 | 67 | 83 | 53 | 80 |
Contig N50 size (kb) | 124.7 | 111.viii | 133.0 | 132.6 | 144.0 | 132.8 | 178.3 | 171.viii |
Contig N90 size (kb) | 31.9 | 35.0 | 35.2 | 31.5 | 35.2 | 40.1 | 41.4 | xl.1 |
Large misassembiles | 0 | 1 | 0 | 1 | 1 | 0 | 1 | one |
Coverage (%) | 98.78 | 98.76 | 99.eleven | 99.08 | 98.91 | 99.37 | 98.73 | 99.36 |
Execution time a (min) | 7 | 6 | 7 | half-dozen | vi | 6 | vi | 7 |
Meridian memory usage (thou) | 1.iv | 1.7 | 1.4 | 2.1 | 1.iii | 2.7 | 1.3 | 3.five |
aPE-Assembler was run using twenty parallel threads.
Velvet was run with post-obit yard-values, respectively, 23, 31, 43 and 47. –cov_cutoff and –exp_cov was set to machine.
3.2 Experimental data
To appraise our approach against moisture lab data, we used four datasets provided with Allpaths2. Each dataset contains two paired-cease read libraries; i of approximate fragment length 200 bp and the other ranging from 3000 to 4500 bp (Table 4). The single reads were not used.
Table 4.
Details of the experimental datasets
Organism | Staphylococcus aureus | Escherichia coli | Schizosaccharomyces pombe | Neurospora crassa | ||||
---|---|---|---|---|---|---|---|---|
No. of contigs/chromosomes | 3 | 1 | four | 251 | ||||
Genome length | 2 903 107 | 4 638 902 | 12 554 318 | 39 225 835 | ||||
Library (bp) | 200 | 3000 | 200 | 3000 | 200 | 3000 | 200 | 3000 |
Read length (bp) | 35 | 26 | 35 | 26 | 35 | 26 | 35 | 26 |
Average insert size (bp) | 224 | 3845 | 210 | 3771 | 208 | 3658 | 210 | 3650 |
Insert size range (bp) | 195–255 | 3175–4725 | 180–260 | 3026–4626 | 195–265 | 2935–4535 | 175–245 | 2875–4675 |
No. of paired reads (millions) | 5.52 | iii.89 | xv.04 | 5.46 | 27.58 | 25.62 | 95.66 | 61.88 |
Approximate coverage | 130× | 35× | 230× | 60× | 150× | 110× | 170× | 80× |
Organism | Staphylococcus aureus | Escherichia coli | Schizosaccharomyces pombe | Neurospora crassa | ||||
---|---|---|---|---|---|---|---|---|
No. of contigs/chromosomes | three | one | 4 | 251 | ||||
Genome length | 2 903 107 | 4 638 902 | 12 554 318 | 39 225 835 | ||||
Library (bp) | 200 | 3000 | 200 | 3000 | 200 | 3000 | 200 | 3000 |
Read length (bp) | 35 | 26 | 35 | 26 | 35 | 26 | 35 | 26 |
Average insert size (bp) | 224 | 3845 | 210 | 3771 | 208 | 3658 | 210 | 3650 |
Insert size range (bp) | 195–255 | 3175–4725 | 180–260 | 3026–4626 | 195–265 | 2935–4535 | 175–245 | 2875–4675 |
No. of paired reads (millions) | 5.52 | three.89 | xv.04 | 5.46 | 27.58 | 25.62 | 95.66 | 61.88 |
Estimate coverage | 130× | 35× | 230× | 60× | 150× | 110× | 170× | lxxx× |
Table four.
Details of the experimental datasets
Organism | Staphylococcus aureus | Escherichia coli | Schizosaccharomyces pombe | Neurospora crassa | ||||
---|---|---|---|---|---|---|---|---|
No. of contigs/chromosomes | 3 | 1 | iv | 251 | ||||
Genome length | 2 903 107 | 4 638 902 | 12 554 318 | 39 225 835 | ||||
Library (bp) | 200 | 3000 | 200 | 3000 | 200 | 3000 | 200 | 3000 |
Read length (bp) | 35 | 26 | 35 | 26 | 35 | 26 | 35 | 26 |
Average insert size (bp) | 224 | 3845 | 210 | 3771 | 208 | 3658 | 210 | 3650 |
Insert size range (bp) | 195–255 | 3175–4725 | 180–260 | 3026–4626 | 195–265 | 2935–4535 | 175–245 | 2875–4675 |
No. of paired reads (millions) | 5.52 | three.89 | fifteen.04 | five.46 | 27.58 | 25.62 | 95.66 | 61.88 |
Judge coverage | 130× | 35× | 230× | 60× | 150× | 110× | 170× | 80× |
Organism | Staphylococcus aureus | Escherichia coli | Schizosaccharomyces pombe | Neurospora crassa | ||||
---|---|---|---|---|---|---|---|---|
No. of contigs/chromosomes | 3 | 1 | 4 | 251 | ||||
Genome length | two 903 107 | four 638 902 | 12 554 318 | 39 225 835 | ||||
Library (bp) | 200 | 3000 | 200 | 3000 | 200 | 3000 | 200 | 3000 |
Read length (bp) | 35 | 26 | 35 | 26 | 35 | 26 | 35 | 26 |
Average insert size (bp) | 224 | 3845 | 210 | 3771 | 208 | 3658 | 210 | 3650 |
Insert size range (bp) | 195–255 | 3175–4725 | 180–260 | 3026–4626 | 195–265 | 2935–4535 | 175–245 | 2875–4675 |
No. of paired reads (millions) | 5.52 | 3.89 | xv.04 | 5.46 | 27.58 | 25.62 | 95.66 | 61.88 |
Approximate coverage | 130× | 35× | 230× | 60× | 150× | 110× | 170× | 80× |
As the reference genome is provided for every dataset, the evaluation criteria remained the same as Section iii.1. However, since the reference genome and the sequenced genome are non expected to exist identical, some minor errors are expected and allowed when we map the assembled contigs onto the reference genome. The results are summarized in Tabular array v. It shows that PE-Assembler is equally adept in handling experimental data. It records the highest contiguity in the form of N50 sizes across all four datasets.
Tabular array five.
Comparing of experimental data results
Staphylococcus aureus | Escherichia coli | |||||||
---|---|---|---|---|---|---|---|---|
PA | Velvet | Allpaths2 | Completeness | PA | Velvet | Allpaths2 | Abyss | |
| ||||||||
Parameters | −minol = 25 | −thousand = 23 | −g = 21 | −grand = 25 | −minol = 25 | −k = 27 | −1000 = 21 | −g = 25 |
−cov = auto | −j = 2 −n = 10 | −cov = 12 | −j = 2 −n = 10 | |||||
−exp = motorcar | −np = 8 | −exp = auto | −np = 8 | |||||
Contig statistics | ||||||||
No. of contigs (>200 bp) | 24 | sixty | xiv | 187 | 21 | 121 | 25 | 277 |
Average length (kb) | 119.8 | 48.0 | 205.0 | 18.3 | 176.8 | 37.5 | 184.1 | 21.four |
Maximum length (kb) | 949.9 | 475.half dozen | 1122.viii | 175.ane | 895.9 | 356.6 | 1015.three | 160.4 |
Contig N50 size (kb) | 685.8 | 314.9 | 477.2 | 63.8 | 428.viii | 105.6 | 337.i | 55.2 |
Contig N90 size (kb) | 107.5 | 37.79 | 84.0 | 31.nine | 143.1 | 25.4 | 81.seven | 31.eight |
Coverage (%) | 99.45 | 98.99 | 99.24 | 98.28 | 99.56 | 99.19 | 99.63 | 98.96 |
Evaluation | ||||||||
Large misassemblies | 0 | 5 | 0 | 1 | 0 | 4 | 0 | 1 |
Segment maps (%) | 98.48 | 96.66 | 98.55 | 94.56 | 98.73 | 95.60 | 99.18 | 94.55 |
Operation a | ||||||||
Total execution time (min) | 17 | 8 | 95 | 13 | 34 | 25 | 222 | 29 |
Elevation retentivity usage (gb) | 1.nine | 2.8 | 20 | 2.half dozen | 3.3 | 6.ix | 37.6 | 5.3 |
Staphylococcus aureus | Escherichia coli | |||||||
---|---|---|---|---|---|---|---|---|
PA | Velvet | Allpaths2 | ABySS | PA | Velvet | Allpaths2 | ABySS | |
| ||||||||
Parameters | −minol = 25 | −thou = 23 | −yard = 21 | −k = 25 | −minol = 25 | −k = 27 | −k = 21 | −yard = 25 |
−cov = car | −j = 2 −n = ten | −cov = 12 | −j = 2 −due north = 10 | |||||
−exp = auto | −np = 8 | −exp = motorcar | −np = 8 | |||||
Contig statistics | ||||||||
No. of contigs (>200 bp) | 24 | lx | 14 | 187 | 21 | 121 | 25 | 277 |
Average length (kb) | 119.8 | 48.0 | 205.0 | xviii.3 | 176.8 | 37.v | 184.ane | 21.iv |
Maximum length (kb) | 949.9 | 475.six | 1122.eight | 175.ane | 895.9 | 356.half-dozen | 1015.3 | 160.4 |
Contig N50 size (kb) | 685.eight | 314.9 | 477.2 | 63.8 | 428.8 | 105.6 | 337.1 | 55.2 |
Contig N90 size (kb) | 107.v | 37.79 | 84.0 | 31.9 | 143.i | 25.4 | 81.7 | 31.eight |
Coverage (%) | 99.45 | 98.99 | 99.24 | 98.28 | 99.56 | 99.xix | 99.63 | 98.96 |
Evaluation | ||||||||
Large misassemblies | 0 | 5 | 0 | 1 | 0 | four | 0 | 1 |
Segment maps (%) | 98.48 | 96.66 | 98.55 | 94.56 | 98.73 | 95.lx | 99.18 | 94.55 |
Performance a | ||||||||
Total execution time (min) | 17 | eight | 95 | 13 | 34 | 25 | 222 | 29 |
Peak memory usage (gb) | 1.ix | 2.8 | 20 | ii.6 | 3.3 | 6.9 | 37.half dozen | 5.3 |
Schizosaccharomyces pombe | Neurospora crassa | |||||||
---|---|---|---|---|---|---|---|---|
PA | Velvet | Allpaths2 | ABySS | PA | Velvet | Allpaths2 | Completeness | |
Parameters | −minol = 25 | −k = 25 | −yard = 25 | −minol = 25 | −k = 25 | −k = 25 | ||
−cov = three | −j = 2 −n = 10 | −cov = motorcar | −j = 2 −due north = x | |||||
−exp = auto | −np = eight | −exp = automobile | −np = 16 | |||||
Contig statistics | ||||||||
No. of contigs (>200 bp) | 169 | 362 | 353 | 1028 | 2708 | 5079 | 1687 | 9916 |
Average length (kb) | 72.1 | 33.7 | 33.8 | 13.0 | 12.eight | 6.eight | xviii.3 | 3.8 |
Maximum length (kb) | 571.1 | 443.0 | 257.2 | 136.eight | 156.2 | 71.0 | 161.2 | 56.0 |
Contig N50 size (kb) | 147.seven | 110.6 | 50.0 | 36.0 | twenty.7 | eleven.6 | 17.6 | viii.1 |
Contig N90 size (kb) | 40.0 | 33.2 | 12.two | 12.3 | - | - | - | i.0 |
Coverage (%) | 96.97 | 97.82 | 95.20 | 97.93 | 87.40 | 87.70 | 78.38 | 88.70 |
Evaluation | ||||||||
Large misassemblies | 3 | 26 | 2 | 27 | 16 | 273 | xviii | 395 |
Segment maps (%) | 95.51 | 94.26 | 92.sixty | 91.08 | 82.06 | 77.44 | 74.66 | 71.28 |
Operation a | ||||||||
Total execution time (min) | 364 | 125 | 4830 b | 72 | 1416 | 266 | 5196 b | 331 |
Pinnacle retentivity usage (gb) | vi.6 | 15 | North/A | half-dozen.6 | 21 | 45 | Northward/A | 25.vi |
Schizosaccharomyces pombe | Neurospora crassa | |||||||
---|---|---|---|---|---|---|---|---|
PA | Velvet | Allpaths2 | ABySS | PA | Velvet | Allpaths2 | ABySS | |
Parameters | −minol = 25 | −k = 25 | −chiliad = 25 | −minol = 25 | −g = 25 | −k = 25 | ||
−cov = 3 | −j = two −n = 10 | −cov = auto | −j = 2 −north = ten | |||||
−exp = auto | −np = 8 | −exp = auto | −np = 16 | |||||
Contig statistics | ||||||||
No. of contigs (>200 bp) | 169 | 362 | 353 | 1028 | 2708 | 5079 | 1687 | 9916 |
Average length (kb) | 72.1 | 33.vii | 33.8 | 13.0 | 12.8 | 6.8 | 18.3 | 3.8 |
Maximum length (kb) | 571.i | 443.0 | 257.ii | 136.viii | 156.ii | 71.0 | 161.2 | 56.0 |
Contig N50 size (kb) | 147.vii | 110.6 | 50.0 | 36.0 | 20.7 | 11.six | 17.vi | viii.one |
Contig N90 size (kb) | 40.0 | 33.2 | 12.2 | 12.3 | - | - | - | 1.0 |
Coverage (%) | 96.97 | 97.82 | 95.20 | 97.93 | 87.xl | 87.70 | 78.38 | 88.70 |
Evaluation | ||||||||
Large misassemblies | three | 26 | 2 | 27 | sixteen | 273 | 18 | 395 |
Segment maps (%) | 95.51 | 94.26 | 92.60 | 91.08 | 82.06 | 77.44 | 74.66 | 71.28 |
Operation a | ||||||||
Total execution time (min) | 364 | 125 | 4830 b | 72 | 1416 | 266 | 5196 b | 331 |
Peak retentivity usage (gb) | 6.6 | 15 | Due north/A | 6.6 | 21 | 45 | N/A | 25.half dozen |
aAll experiments were run in a 8-core motorcar except for N.crassa dataset, which was run using sixteen-cores.
bReported as in Allpaths2 publication, where experiments were carried out in a 16-core machine.
Table 5.
Comparison of experimental information results
Staphylococcus aureus | Escherichia coli | |||||||
---|---|---|---|---|---|---|---|---|
PA | Velvet | Allpaths2 | ABySS | PA | Velvet | Allpaths2 | ABySS | |
| ||||||||
Parameters | −minol = 25 | −k = 23 | −k = 21 | −k = 25 | −minol = 25 | −1000 = 27 | −yard = 21 | −k = 25 |
−cov = auto | −j = 2 −northward = 10 | −cov = 12 | −j = 2 −northward = 10 | |||||
−exp = auto | −np = viii | −exp = auto | −np = 8 | |||||
Contig statistics | ||||||||
No. of contigs (>200 bp) | 24 | 60 | 14 | 187 | 21 | 121 | 25 | 277 |
Average length (kb) | 119.8 | 48.0 | 205.0 | 18.three | 176.eight | 37.5 | 184.1 | 21.4 |
Maximum length (kb) | 949.9 | 475.vi | 1122.8 | 175.1 | 895.9 | 356.vi | 1015.three | 160.4 |
Contig N50 size (kb) | 685.viii | 314.nine | 477.2 | 63.8 | 428.8 | 105.6 | 337.1 | 55.two |
Contig N90 size (kb) | 107.5 | 37.79 | 84.0 | 31.9 | 143.1 | 25.4 | 81.seven | 31.eight |
Coverage (%) | 99.45 | 98.99 | 99.24 | 98.28 | 99.56 | 99.nineteen | 99.63 | 98.96 |
Evaluation | ||||||||
Large misassemblies | 0 | 5 | 0 | one | 0 | four | 0 | one |
Segment maps (%) | 98.48 | 96.66 | 98.55 | 94.56 | 98.73 | 95.60 | 99.18 | 94.55 |
Functioning a | ||||||||
Total execution fourth dimension (min) | 17 | 8 | 95 | 13 | 34 | 25 | 222 | 29 |
Tiptop memory usage (gb) | ane.ix | 2.viii | 20 | 2.six | 3.3 | 6.9 | 37.6 | 5.iii |
Staphylococcus aureus | Escherichia coli | |||||||
---|---|---|---|---|---|---|---|---|
PA | Velvet | Allpaths2 | ABySS | PA | Velvet | Allpaths2 | Completeness | |
| ||||||||
Parameters | −minol = 25 | −k = 23 | −thousand = 21 | −thousand = 25 | −minol = 25 | −k = 27 | −yard = 21 | −k = 25 |
−cov = auto | −j = 2 −n = x | −cov = 12 | −j = 2 −n = 10 | |||||
−exp = car | −np = viii | −exp = auto | −np = 8 | |||||
Contig statistics | ||||||||
No. of contigs (>200 bp) | 24 | sixty | 14 | 187 | 21 | 121 | 25 | 277 |
Average length (kb) | 119.eight | 48.0 | 205.0 | 18.3 | 176.8 | 37.5 | 184.1 | 21.four |
Maximum length (kb) | 949.9 | 475.vi | 1122.8 | 175.one | 895.9 | 356.6 | 1015.3 | 160.four |
Contig N50 size (kb) | 685.8 | 314.9 | 477.two | 63.eight | 428.eight | 105.6 | 337.1 | 55.2 |
Contig N90 size (kb) | 107.5 | 37.79 | 84.0 | 31.9 | 143.1 | 25.iv | 81.vii | 31.8 |
Coverage (%) | 99.45 | 98.99 | 99.24 | 98.28 | 99.56 | 99.xix | 99.63 | 98.96 |
Evaluation | ||||||||
Large misassemblies | 0 | 5 | 0 | 1 | 0 | four | 0 | 1 |
Segment maps (%) | 98.48 | 96.66 | 98.55 | 94.56 | 98.73 | 95.60 | 99.18 | 94.55 |
Performance a | ||||||||
Total execution time (min) | 17 | 8 | 95 | xiii | 34 | 25 | 222 | 29 |
Peak retentivity usage (gb) | 1.9 | 2.8 | 20 | ii.6 | 3.three | 6.9 | 37.6 | five.3 |
Schizosaccharomyces pombe | Neurospora crassa | |||||||
---|---|---|---|---|---|---|---|---|
PA | Velvet | Allpaths2 | Completeness | PA | Velvet | Allpaths2 | Completeness | |
Parameters | −minol = 25 | −chiliad = 25 | −m = 25 | −minol = 25 | −one thousand = 25 | −k = 25 | ||
−cov = 3 | −j = two −n = x | −cov = car | −j = 2 −n = x | |||||
−exp = machine | −np = eight | −exp = motorcar | −np = 16 | |||||
Contig statistics | ||||||||
No. of contigs (>200 bp) | 169 | 362 | 353 | 1028 | 2708 | 5079 | 1687 | 9916 |
Average length (kb) | 72.1 | 33.7 | 33.8 | xiii.0 | 12.8 | vi.8 | 18.3 | 3.viii |
Maximum length (kb) | 571.i | 443.0 | 257.2 | 136.8 | 156.2 | 71.0 | 161.2 | 56.0 |
Contig N50 size (kb) | 147.7 | 110.6 | 50.0 | 36.0 | xx.7 | 11.6 | 17.half-dozen | viii.1 |
Contig N90 size (kb) | 40.0 | 33.2 | 12.two | 12.3 | - | - | - | 1.0 |
Coverage (%) | 96.97 | 97.82 | 95.20 | 97.93 | 87.forty | 87.lxx | 78.38 | 88.seventy |
Evaluation | ||||||||
Large misassemblies | 3 | 26 | 2 | 27 | 16 | 273 | 18 | 395 |
Segment maps (%) | 95.51 | 94.26 | 92.sixty | 91.08 | 82.06 | 77.44 | 74.66 | 71.28 |
Performance a | ||||||||
Total execution fourth dimension (min) | 364 | 125 | 4830 b | 72 | 1416 | 266 | 5196 b | 331 |
Peak memory usage (gb) | six.half-dozen | fifteen | Northward/A | 6.6 | 21 | 45 | N/A | 25.6 |
Schizosaccharomyces pombe | Neurospora crassa | |||||||
---|---|---|---|---|---|---|---|---|
PA | Velvet | Allpaths2 | Abyss | PA | Velvet | Allpaths2 | ABySS | |
Parameters | −minol = 25 | −g = 25 | −1000 = 25 | −minol = 25 | −k = 25 | −k = 25 | ||
−cov = iii | −j = 2 −north = 10 | −cov = auto | −j = 2 −north = 10 | |||||
−exp = automobile | −np = 8 | −exp = automobile | −np = 16 | |||||
Contig statistics | ||||||||
No. of contigs (>200 bp) | 169 | 362 | 353 | 1028 | 2708 | 5079 | 1687 | 9916 |
Average length (kb) | 72.one | 33.7 | 33.eight | 13.0 | 12.8 | 6.eight | eighteen.3 | 3.eight |
Maximum length (kb) | 571.one | 443.0 | 257.ii | 136.viii | 156.2 | 71.0 | 161.2 | 56.0 |
Contig N50 size (kb) | 147.7 | 110.vi | fifty.0 | 36.0 | 20.7 | 11.6 | 17.six | eight.i |
Contig N90 size (kb) | 40.0 | 33.two | 12.2 | 12.3 | - | - | - | 1.0 |
Coverage (%) | 96.97 | 97.82 | 95.20 | 97.93 | 87.40 | 87.70 | 78.38 | 88.70 |
Evaluation | ||||||||
Large misassemblies | iii | 26 | two | 27 | 16 | 273 | 18 | 395 |
Segment maps (%) | 95.51 | 94.26 | 92.60 | 91.08 | 82.06 | 77.44 | 74.66 | 71.28 |
Functioning a | ||||||||
Full execution fourth dimension (min) | 364 | 125 | 4830 b | 72 | 1416 | 266 | 5196 b | 331 |
Superlative memory usage (gb) | six.6 | 15 | Northward/A | 6.6 | 21 | 45 | N/A | 25.6 |
aAll experiments were run in a viii-core machine except for Northward.crassa dataset, which was run using sixteen-cores.
bReported every bit in Allpaths2 publication, where experiments were carried out in a 16-core car.
For the 2 smaller genomes, the coverage statistics are nearly identical for all iv approaches. Assemblies produced by Velvet and Abyss shows several large misassemblies whereas those of PE-Assembler and Allpaths2 are void of such errors. Performance-wise, PE-Assembler is more than efficient in retention consumption compared with all other programs. Peculiarly noteworthy is the big amount of retentivity consumed by Allpaths2 to assemble even the smallest of genomes.
Repeated attempts to assemble the two larger datasets using Allpaths2 failed in our organisation. We suspect this is due to high memory usage of Allpaths2. Therefore, the comparison is based on the output provided at Allpaths website. The timing quoted here is that reported on the Allpaths2 publication.
For the highly repetitive S.pombe genome, PE-Assembler results in an assembly with N50 and N90 sizes far greater than that of Allpaths2, Velvet and ABySS. PE-Assembler besides shows improve coverage than Allpaths2. The loftier number of large misassemblies in Velvet and Completeness assemblies demonstrates the susceptibility of de Bruijn graph approach to misassemble genomes in the presence of curt repeat regions. In contrast, PE-Assembler and Allpaths2 results in only three and 2 big misassemblies, respectively. Of the iii 'misassembled' contigs in PE-Assembler output, two of them tin be properly aligned against other strains of Due south.pombe and therefore they are likely due to differences between assembled strain and the reference. PE-Assembler'south assembly for S.pombe also results in the highest number of segments maps, testament to both its coverage and accurateness.
For the relatively larger Neurospora crassa genome, PE-Assembler'due south result leads in terms of contiguity and coverage. Annotation that Allpaths2'south assembly is of significantly depression coverage in comparison with other assemblies. Besides note that N.crassa reference genome is unfinished and information technology consists of many contigs. The 'large misassemblies' reported is likely to exist inflated.
Current version of SOAPdenovo ignores reads of length <35 bp for the scaffolding process. Therefore, we did not exam SOAPdenovo against the experimental datasets equally information technology would not be a fair comparison.
iii.3 Parallelization and running time
One of the nearly of import aspects of our method is its power to bear out the entire assembly process in parallel. We carried out a series of experiments to determine how parallelization affects the execution time of the assembler. The simulated E.coli dataset with 200 and 10 000 bp libraries were assembled using 1–8 separate threads in an 8-core-CPU machine. Each thread was executed in a separate CPU core.
Figure 7 shows that distributing each step across multiple CPU cores in parallel decreases the execution fourth dimension proportionally to the number of CPUs utilized. However, unlike the implementation in Allpaths2, the parallel implementation does not come up at an extra retentivity overhead as the data structures are shared by each thread. In each of the experiments, the maximum memory utilization was constant at i.three GB.
Fig. vii.
Execution time with respect to number of threads/cores utilized. Utilizing multiple cores dramatically reduces execution fourth dimension. Theoretically, the improvement should be linear with number of parallel threads; even so, this is masked by the fact that each step has constant IO overhead which cannot be parallelized.
Fig. seven.
Execution fourth dimension with respect to number of threads/cores utilized. Utilizing multiple cores dramatically reduces execution time. Theoretically, the comeback should exist linear with number of parallel threads; all the same, this is masked by the fact that each pace has abiding IO overhead which cannot exist parallelized.
4 DISCUSSION
PE-Assembler has demonstrated that it is possible to obtain complete and highly accurate de novo genome assemblies using high-throughput sequencing data inside reasonable time and retention constraints. The highlight of PE-Assembler is that it eschews the traditional graph-based approach in favor of a simple extension approach.
The advantages of this approach are numerous. Memory requirements of graph-based approaches seem to increase exponentially as genome and data size increase. This was highlighted by the inability of Velvet and Allpaths2 to cope with simulated HG18 Chr10 dataset. In contrast, PE-Assembler produced a very usable associates within a realistic memory limit.
Our approach is fundamentally similar to other 3′ extension approaches such equally SSAKE, SHARCGS and VCAKE, only distinguishes itself due to its extensive employ of paired-end reads. Not only does it make such arroyo scalable to larger genomes' datasets by localizing data, it also contributes to its high accuracy. As evident from both simulated and experimental data results, PE-Assembler is the least prone of all algorithms to misassemble unlike regions of the genome in a continuous segment.
Maybe the well-nigh important aspect of PE-Assembler is its ability to seamlessly parallelize the assembly process. Multiple threads can simultaneously assemble the genome at various positions beyond the genome, while a simple detection mechanism will ensure that multiple assemblies of the same region are highly unlikely. Also noteworthy is that parallel associates in PE-Assembler does not come up at an extra cost in retention as in other methods such as Allpaths2 or Abyss. Beingness able to massively parallelize the assembly process at no extra overhead, it will evidence valuable in assembling mammalian genomes too as in larger metagenomics projects. With small-scale modifications, this arroyo can be extended to be run in a computer cluster across multiple nodes to further subtract the running time.
ACKNOWLEDGEMENTS
The authors would like to extend their gratitude to Pauline Chen of Inquiry Computing Grouping, GIS, for her aid in evaluation and testing process. Nosotros further like to thank Daniel Zerbino for his help in running Velvet and the reviewers for their useful feedback and insight.
Funding: This inquiry was supported by MOE AcRF Tier 2 funding R-252-000-444-112 and Agency for Science, Engineering science and Research (A*STAR).
Disharmonize of Interest: none declared.
REFERENCES
, et al.
ALLPATHS: de novo assembly of whole-genome shotgun microreads
,
Genome Res.
,
2008
, vol.
eighteen
(pg.
810
-
820
)
, et al.
De novo fragment associates with short mate-paired reads: Does the read length affair?
,
Genome Res.
,
2009
, vol.
19
(pg.
336
-
346
)
, et al.
SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing
,
Genome Res.
,
2007
, vol.
17
(pg.
1697
-
1706
)
, et al.
Extending assembly of brusque DNA sequences to handle error
,
Bioinformatics
,
2007
, vol.
23
(pg.
2942
-
2944
)
.
BLAT–the Smash-like alignment tool
,
Genome Res.
,
2002
, vol.
12
(pg.
656
-
664
)
, et al.
De novo assembly of human genomes with massively parallel brusque read sequencing
,
Genome Res.
,
2010
, vol.
xx
(pg.
265
-
272
)
, et al.
ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads
,
Genome Biol.
,
2009
, vol.
x
pg.
R103
, et al.
An Eulerian path approach to DNA fragment assembly
,
Proc. Natl Acad. Sci. The states
,
2001
, vol.
98
(pg.
9748
-
9753
)
, et al.
Completeness: a parallel assembler for curt read sequence data
,
Genome Res.
,
2009
, vol.
19
(pg.
1117
-
1123
)
, et al.
Assembling millions of short Deoxyribonucleic acid sequences using SSAKE
,
Bioinformatics
,
2007
, vol.
23
(pg.
500
-
501
)
, .
Velvet: algorithms for de novo short read assembly using de Bruijn graphs
,
Genome Res.
,
2008
, vol.
xviii
(pg.
821
-
829
)
Writer notes
Associate Editor: John Quackenbush
© The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Source: https://academic.oup.com/bioinformatics/article/27/2/167/284712
Belum ada Komentar untuk "Using Mate Paired Reads and Paired End in Soap De Novo"
Posting Komentar