Using Mate Paired Reads and Paired End in Soap De Novo

Abstract

Motivation: Many de novo genome assemblers have been proposed recently. The basis for almost existing methods relies on the de bruijn graph: a complex graph structure that attempts to embrace the entire genome. Such graphs can be prohibitively large, may fail to capture subtle information and is difficult to be parallelized.

Result: We present a method that eschews the traditional graph-based arroyo in favor of a unproblematic iii′ extension approach that has potential to exist massively parallelized. Our results show that it is able to obtain assemblies that are more contiguous, complete and less error prone compared with existing methods.

Availability: The software parcel can be found at http://www.comp.nus.edu.sg/~bioinfo/peasm/. Alternatively information technology is available from authors upon asking.

Contact:ksung@comp.nus.edu.sg; sungk@gis.a-star.edu.sg

Supplementary information: Supplementary information are available at Bioinformatics online.

1 INTRODUCTION

De novo genome assembly has been a fundamental problem in bioinformatics since the advent of DNA sequencing. The 2d-generation sequencing technologies such as Illumina Solexa and ABI SOLiD take introduced a new sense of vigor to the field. The curt length of the sequences coupled with high coverage and loftier level of noise has transformed de novo assembly to a tractable yet challenging proposition. The ease at which paired-end read libraries can be generated on these platforms is an added advantage.

A number of works accept been proposed to assemble short reads. The showtime few de novo assemblers adult to handle high-throughput curt reads were based on base-by-base 3′ extension. SSAKE, VCAKE and SHARCGS ( Dohm et al., 2007; Jeck et al., 2007; Warren et al., 2007) are examples using this principle. To resolve ambiguities, these methods adapted trivial heuristics such equally 'selecting the base with maximum overlap' or 'selecting the base with the highest consensus'. Such arbitrary criteria results in substandard assemblies that were oft a compromise between contiguity and error rate. Furthermore, the approaches were not scalable to handle medium or large genomes; therefore, their utilize is restricted to assembling BAC clones or pocket-size leaner genomes. They were as well non designed to make use of paired-cease reads, thus profoundly limiting their usefulness in assembling loftier-throughput data.

The more practical approaches for assembling high-throughput short reads have spawned based on de Bruijn graph arroyo. Velvet (Zerbino and Birney, 2008) is peradventure the most widely used method for de novo genome associates today. It is very fast in execution, fairly memory efficient and produces reasonably accurate assemblies. Similar to all other methods based on de Bruijn graph, Velvet requires the unabridged genome to exist stored in a graph construction. In the presence of noise, the graph may exist too large to be stored on system memory. Furthermore, resulting assembly generated from Velvet tends to comprise many errors at small echo regions. Another approach, Euler-USR ( Chaisson et al., 2009) is very like in concept to Velvet, but employs more than sophisticated fault detection and correction steps. However, in practice, we noted Velvet produces more than contiguous and complete assemblies in comparing with Euler-USR. Both Velvet and Euler-USR accept full advantage of paired-stop read libraries.

One of the major shortcomings of de Bruijn graph approaches is the inability to parallelize the assembly process. This is a critical requirement equally many powerful computers use multiple processors where numerous threads can be run seamlessly in parallel. Introduction of Abyss ( Simpson et al., 2009) tackled this issue. The core assembly algorithm of Abyss is very similar to that of Velvet, only it allows de Bruijn graph to exist distributed across multiple cores/nodes, and each core/node can operate on the graph independently to a certain extent. The assembly effect of Abyss is similar to that of Velvet. Nevertheless, we noticed that when executed in parallel in a multi-cadre single calculator, Abyss does not offer any advantage over Velvet in term of execution fourth dimension or memory usage. To utilize Abyss efficiently, it requires a multi-node computing cluster that may seem a disadvantage in an era where computers are increasingly fabricated faster by calculation more cores within a unmarried CPU. SOAPdenovo ( Li et al., 2010) addressed many of these problems by introducing a de Bruijn graph-based method that can seamlessly takes advantage of multi-core systems.

Allpaths/Allpaths2 ( Butler et al., 2008; MacCallum et al., 2009) appears to be the most accurate method at present. It introduces an interesting hybrid approach where the genome is withal stored as a large graph; notwithstanding, the graph is separated into different segments and assembly of these segments tin exist carried independently. This makes it possible to run some stages of Allpaths algorithm in parallel. The loftier accurateness of Allpaths is contributed by the fact that it tries all possible ways to assemble every segments; nevertheless, this comes at a tremendous cost in terms of fourth dimension and memory usage, and therefore it will non augment well for larger genomes.

We propose the method PE-Assembler that is capable of handling big datasets and produces highly contiguous and accurate assemblies within reasonable fourth dimension. Our approach is based on uncomplicated three′ extension approach and does not involve representing the entire genome in the grade of a graph. Fundamentally, it is similar to other 3′ extension approaches such equally SSAKE, VCAKE and SHARCGS. However, it improves upon such early approaches in multiple ways. The extensive employ of paired-end reads ensures that the dataset is localized within the region. Hence, our method can be run in parallel to greatly speedup the execution while staying within reasonable system requirements. Ambiguities are resolved using a multiple path extension approach, which takes into account sequence coverage, support from multiple paired libraries and more than subtle data such every bit the span distribution of the paired-finish reads.

two METHODS

Paired-end reads are also known every bit paired reads or mate pairs (depending on some technical differences) in unlike literature. Essentially, they all refer to a pair of short reads that originates from five′ to iii′ ends of a DNA fragment whose length is known approximately. The length of the fragment is referred to equally the insert size. For every paired-end read, its two reads are called the mates of each other. The length of each read is denoted equally ReadLength. It could be of any length from 25 to 100 bp. The insert size is not verbal. It may vary from MinSpan to MaxSpan.

Our program is called PE-Assembler, which aims to reconstruct the sample genome from a paired-stop read library. PE-Assember can also have multiple paired-end read libraries of unlike insert sizes, which can facilitate to resolve ambiguities that cannot be conclusively resolved using a single paired-end read library.

PE-Assembler is fundamentally based on iii′ overlap extension, like to SSAKE and VCAKE. The procedure is illustrated in Figure 1. Given a sequence, PE-Assembler extracts all reads whose prefix aligns with the suffix of the sequence. Nosotros define this equally an overlap. The suffix of each read, which overhangs from the three′ of the sequence, forms a feasible extension to the contig. If in that location is a articulate consensus for a single base, so that base of operations is appended to the end of the sequence and the process is iterated. Multiple feasible extensions are handled differently in various stages of the algorithm and are described in following sections.

Fig. ane.

Overview of 3′ overlap extension. Both t and g are feasible extensions.

PE-Assembler is implemented as a serial of five steps, which are briefly described as follows (besides see Supplementary Fig. 1). First, the read screening step selects a set of reads (chosen 'solid' reads) as starting points for extending the assembly. This pace specifically avoids reads containing sequencing errors and reads occurring in repeat regions in the genome. The second step then extends these 'solid' reads using unmarried cease reads to make them longer than MaxSpan. Those successfully extended regions are called seeds. Seeds are long enough for extension using paired-end reads. Our third step (called contig extension) tries to extend all these seeds using paired-end reads. The resulting sequences are chosen contigs. The 4th footstep links those contigs using paired-end reads to form scaffolds (i.e. ordered prepare of contigs with gaps in betwixt). Finally, the final step tries to fill-in the gaps in between scaffolded contigs. Below, nosotros will detail the five steps.

2.i Read screening

Many curt read assemblers perform error correction/detection steps prior to the assembly. While it is more often than not effective in detecting and fixing random sequencing errors, it treats each read equally a single read and therefore fails to apply the pairing data. This may consequence in overcorrecting the reads coming from low coverage regions every bit the actual location of the paired-end read is not taken into account.

Our approach does not perform mistake correction. However, nosotros require a pool of error-free and non-repetitive reads as starting points for the seed building step (Section 2.2). These reads are isolated by conveying out a read screening step.

The thought behind the screening step is similar to the kmer frequency based mistake correction method proposed by Pevzner et al. (2001). Its details are as follows. A kmer is a length k Dna sequence. Provided the genome is sampled at a high coverage, a one thousandmer that occurs in the genome is probable to occur multiple times in the input reads. Suppose a detail mmer occurs once (or very sparingly) in the input reads, such yardmer is unlikely to occur in the target genome and is probable to be a issue of a sequencing error. Similarly, if a kmer occurs at a higher frequency than expected, we can conclude that it may have originated from a repeat region in the genome. A mmer that is expected to occur in the actual genome is called a 'solid' grandmer while a yardmer that is expected to occur inside a repeat region is called a 'echo' kmer. To classify a read as either a solid kmer or a repeat kmer, we browse the unabridged dataset of reads to extract the set of thoumers and their frequencies. A kmer frequency histogram is plotted. And then, nosotros identify the solid kmer threshold and the echo gmer threshold from the troughs on either side of the main meridian (Fig. 2). A read is said to be 'solid' if the frequencies of all its kmers are college than the solid grandmer threshold and lower than the repeat thousandmer threshold. Only solid reads are chosen every bit the offset points for the next step. Note that this stage does not discard or right whatever data. The entire dataset is used in the assembly equally it is.

Fig. two.

The kmer frequency histogram. Nosotros tin determine the solid chiliadmer cutoff and repeat gmer cutoff from the 2 troughs.

2.2 Seed building

A 'seed' is defined as a contiguous region in the target genome which is of length at to the lowest degree MaxSpan. To assemble a seed, we start with an unused solid read as the initial seed and carry out 3′ overlap extension equally described higher up. Nevertheless, due to the presence of pocket-size repeats or sequencing errors, in that location may exist multiple feasible candidates as the next 3′ base.

Ambiguities arising due to repeats can be resolved with the help of paired-end reads. Throughout the seed associates, we maintain a pool of reads whose mates map on to the current seed. In instance of whatever ambiguity, for every read overlapping with the seed, nosotros check if its mate overlaps with whatever reads in the maintained pool (Fig. 3). Those without overlap support are assumed to be noise and thus discarded.

Fig. 3.

Resolving ambiguities in the seed building step: suppose the current seed can exist extended by two possible candidates 'a' and 'g'. Assume that, for reads extending 'g', their mates overlap with the reads in the pool, while the reads extending 'a' exercise not accept such support. Then, we tin safely select candidate 'g' for extension.

The above method cannot resolve ambiguities arising due to sequencing errors. In such case, nosotros extend every candidate base of operations up to a altitude of ReadLength. Whatever extension path arising due to sequencing errors is likely to be terminated prematurely. If just one candidate path can reach the full altitude, then that path is assumed to be the correct extension.

At any stage, if in that location is no candidate for extension (likely due to low sequencing coverage) or multiple candidates for extensions (possibly due to longer repeats), the extension is terminated. Seed will then be extended from the other side. The extension will be 'successfully' terminated once the seed reaches the length of MaxSpan.

For every successfully terminated seed, a seed verification footstep is performed to ensure that the seed represents a face-to-face region in the target genome. Precisely, to verify the 3′ end of a seed, we crave at least one paired-end read overlaps with 3′ end of the seed (Fig. iv). Similarly, we can verify the 5′ end of a seed. All verified reads are immediately subjected to contig extension step (Department 2.3). Seeds which fail the verification step are discarded.

Fig. four.

A paired-stop read is said to overlap the 3′ end of a seed if the 3′ read of the paired-end read overlaps the 3′ terminate of the seed and the 5′ read maps on the seed within the expected region, equally adamant by MinSpan and MaxSpan of the library.

ii.3 Contig extension

The contig extension step aims to extend each verified seed to form a longer contig iteratively. Again, this step relies on overlap extension to elongate the electric current contig; but with some differences. Since a contig is longer than MaxSpan, instead of using single reads to extend the contig, we try to place feasible extensions from paired-stop reads that overlap with the contig. Moreover, when no paired-end read is found overlapping with the contig, we place viable extensions from overlapping reads instead.

If a articulate consensus is constitute amidst the feasible extensions, then that base is appended to the terminate of the contig and procedure is repeated. Occasionally, in that location are multiple feasible candidates to extend the contig. Such scenario may arise due to three reasons. The first reason is sequencing errors. These errors can be dealt like to the seed building step. The second reason is due to short tandem repeat regions. In such example, we stop the extension and nosotros volition try to estimate the correct number of tandem repeats during the gap filling pace. The tertiary reason is due to long repeats. In such case we likewise end the extension. Note that when the echo is longer than MaxSpan, nosotros cannot theoretically resolve the ambivalence using the given paired-end read data. A paired-terminate read library of longer insert size is required to resolve such ambiguity.

The contig extension step is performed until we cannot extend the contig from both ends. So, the resultant contig is kept to be used in scaffolding.

2.4 Scaffolding

The objective of the scaffolding step is to find the correct ordering of the resulting prepare of contigs.

As the scaffolding footstep is very sensitive to the presence of repeat regions, the kickoff step is to demarcate all repeat regions within assembled contigs. In this pace, all individual reads are mapped back to the contigs and read density across all the contigs is calculated. The mode of the read density is assumed to be the expected read coverage across the genome. Any region with read density higher than 1.five times of the mode is considered every bit a echo region. Any reads mapped onto such echo region are discarded.

During this step, additional statistics such as boilerplate span and standard deviation for each library is calculated. This information is used during the gap filling phase.

For the scaffolding step, we merely consider the paired-end reads whose ii reads map uniquely to two different contigs. Such a mapping is referred as a chimeric mapping. Although nosotros cannot estimate the exact span of a chimeric mapping, the minimum bridge for a chimeric mapping can be calculated past the distance that it has covered on the two contigs (Fig. 5). Every paired-cease read mapping whose minimum span exceeds MaxSpan is discarded.

Fig. 5.

Minimum span distance of this chimeric mapping is a+b. Actual span may vary depending on the gap size between contigs X and Z.

Two contigs of specific orientation are said to be linked by an edge if there is at least a certain number of chimeric mappings between the 2 contigs in that orientation. The weight of the edge is the full number of such chimeric mappings, normalized past the total number of paired-end reads in the library. The maximum gap size is estimated by subtracting MaxSpan by the boilerplate of minimum spans of all chimeric paired-end reads of that edge. Multiple fragment libraries of different insert sizes may be used at this point. Each library volition result in its own distinct set of edges.

A potential scaffold is a linear ordering of contigs. An edge betwixt two contigs X and Z is deemed satisfied if both contigs X and Z occur within the aforementioned scaffold in a right orientation and the full length of all contigs between Ten and Z is less than the maximum gap size estimated past that edge; otherwise, if X and Z cannot exist arranged so that they are within the expected span, the edge is said to be contradicted. The score for each scaffold is calculated by totaling the weights of all satisfied edges and subtracting the weights of all contradicted edges.

The aim of the scaffolding algorithm is to produce a set of scaffolds such that the in a higher place score is maximized. Notwithstanding, exact solution to this is computationally prohibitive. Therefore, we employ the following greedy heuristic approach.

The scaffolding procedure starts by selecting a contig at random as the initial scaffold. The procedure extends the scaffold iteratively by including contigs to the correct. A contig X is said to be a right neighbor of a scaffold if there exists some contig Z in the scaffold such that (Z, Ten) is an border and the total length of contigs to the right of Z in the scaffold is less than the maximum gap size of (Z, X). All right neighbors of the scaffold are potential candidates to extend the scaffold from its 3′ end. Each candidate correct neighbor is temporally added to the 3′ finish of the scaffold, and all permutations of remaining right neighbors are appended after it to obtain multiple possible orderings. Each such potential ordering is evaluated. The candidate right neighbor that results in the ordering with the highest score is permanently added to the 3′ terminate of the scaffold. This process is repeated until any of the post-obit occurs: neighborhood is empty; best ordering score is negative or the current region of the scaffolding has already been ordered elsewhere. If scaffolding is terminated from the three′ end, we try to extend the scaffold from the five′ end. Once both ends are not extendable, we obtain i scaffold and the entire procedure is repeated with an unused contig as the start point to place other scaffolds.

ii.5 Gap filling

The scaffolding step reports a list (or lists) of contigs in the aforementioned social club equally they would be in the actual genome. The side by side contigs are commonly separated by an unknown sequence. The objective of the gap filling step is to assemble the gap region between two adjacent contigs to form a longer contig. Annotation that the length of the gap can exist estimated using paired-end reads, which map across two side by side contigs.

For every read that occurs in the gap, its mate must map to either the left or the right contig of the gap. Hence, the gap can be filled in using such reads. Every bit we are dealing with a localized set of data, gap filling step tin employ a less stringent minimum overlap length, thus facilitating assembly of low-coverage regions.

A key divergence between the gap filling stride and the seed building step is that the onetime tin can resolve convoluted repeat regions past exploiting span data of paired-finish reads to a greater degree. Similar to seed building step, the assembly is carried out using overlap extension. Whenever in that location are multiple extension paths due to multiple candidate bases, each path is extended upwardly to a altitude of ReadLength. Moreover, for each extension path, nosotros can obtain the span histogram of all paired-stop reads, which map on this extension path. The distribution of this 'perceived' span for each extension path is compared against the bridge distribution of the entire library. The span distribution of the correct extension will be inline with that of the entire library, whereas distributions of incorrect extensions will exhibit a noticeable shift. This idea is demonstrated in Effigy 6.

Fig. 6.

Utilize of boilerplate span and standard deviation to resolve ambiguities in Staphylococcus aureus assembly: (A) Reference sequence from region in question. Bolded segments are identical. Both 'a' and 't' seems a valid choice after that region. (B) Sequence overlap shows both 'a' and 't' as potential candidates. Both paths are extended up to a distance of 'TagLength'. (C) Illustration of the two different extensions. For each path, spans of paired-end reads mapping beyond the branching signal is kept. Spans resulting from the incorrect assembly are noticeably shorter due to missing region. (D) Histograms of perceived paired-stop read bridge of two dissimilar paths and that of the entire library. Notation that span distribution of correct path closely follows that of the library. Path with the span distribution closest to the library span distribution is called.

The adjacent contigs whose gap can exist successfully bridged are merged as a single contig. The resulting gear up of contigs from this footstep represents the final output of the assembly.

2.half dozen Parallelization

This section discusses the issue of parallelization for the five steps. For the read screening pace, since it is largely disk leap, parallelizing this step does non improve the operation noticeably. All remaining steps can be run as threads on multiple cores sharing the same memory space almost independent of each other.

For the seed building step and the contig extension step, the solid reads and the seeds are distributed to different threads for parallel execution. Provided the genome is reasonably large and the number of threads is not impractically high, we tin can presume most of the threads assemble different regions of the genome. Every thread will mark the reads which are and so far used in the associates. Periodically, every thread will refer to this information to detect if the region it is currently assembling has been previously assembled by other threads. If a read is detected to be marked by other threads, the thread will rewind the assembly to the last unmarked read and terminated.

Scaffolding footstep involves mapping back each paired-end read to assembled contigs and forming a graph comprising of contigs as nodes and 'chimeric' paired-end reads as edges. The graph edifice step is carried out in parallel. Actual scaffolding is carried out in a single thread; even so, this step is not very time consuming.

Gap filling can exist executed in parallel since gap filling is localized and is independent from one another.

For the entire assembly process, the time taken is roughly inversely proportional to the number of cores/threads utilized. (Please refer to the Section 3.)

3 RESULTS

3.i Simulated data

To evaluate the goodness of our approach, experiments were carried out on simulated datasets based on Escherichia coli and Schizosaccharomyces pombe reference genomes. Iii libraries of paired-end reads with varying 'fragment lengths' were simulated from each genome (Table ane); a short fragment library of average span 200 bp, a medium fragment library of boilerplate span 1000 bp and a long fragment library of average span 10 000 bp. All reads were assumed to be 35 bp long. Precise criteria for simulation are detailed in Supplementary section. For comparison, we executed all popular de novo assembly programs such as Allpaths2, Velvet, ABySS and SOAPdenovo in addition to PE-Assembler (denoted by 'PA' in tables). Each program was run with multiple parameters and the best result for each program is quoted below. The summary results for all experiments and parameters are available in Supplementary section.

Tabular array 1.

Details of the fake dataset

Organism	Escherichia coli			Schizosaccharomyces pombe			HG 18-Chr x
No. of contigs/chromosomes	1			3			i
Genome length (bp)	4 639 658			12 571 820			135 374 737
Library	200 bp	1 kb	ten kb	200 bp	1 kb	10 kb	200 bp	i kb	x kb
Read length (bp)	35	35	35	35	35	35	75	75	75
Boilerplate insert size (bp)	235	1035	10035	235	1035	10035	275	1075	10075
Insert size range (average ± bp)	±forty	±200	±2000	±twoscore	±200	±2000	±40	±200	±2000
No. of paired reads (millions)	3.31	3.31	3.31	8.98	8.98	eight.98	45.12	nine.02	ix.02
Coverage	fifty×	50×	50×	50×	fifty×	50×	50×	10×	10×
Seq. error rate, %	2.0	two.0	2.0	2.0	2.0	ii.0	2.0	2.0	two.0
Ligation error rate, %	0.0	2.0	2.0	0.0	2.0	ii.0	0.0	ii.0	ii.0

Organism	Escherichia coli			Schizosaccharomyces pombe			HG xviii-Chr 10
No. of contigs/chromosomes	1			3			1
Genome length (bp)	four 639 658			12 571 820			135 374 737
Library	200 bp	1 kb	10 kb	200 bp	1 kb	x kb	200 bp	one kb	x kb
Read length (bp)	35	35	35	35	35	35	75	75	75
Boilerplate insert size (bp)	235	1035	10035	235	1035	10035	275	1075	10075
Insert size range (average ± bp)	±40	±200	±2000	±40	±200	±2000	±xl	±200	±2000
No. of paired reads (millions)	3.31	three.31	3.31	eight.98	8.98	viii.98	45.12	9.02	9.02
Coverage	50×	50×	fifty×	50×	l×	l×	50×	10×	10×
Seq. error rate, %	2.0	2.0	ii.0	2.0	2.0	2.0	2.0	2.0	2.0
Ligation fault rate, %	0.0	two.0	2.0	0.0	2.0	2.0	0.0	2.0	2.0

Table 1.

Details of the false dataset

Organism	Escherichia coli			Schizosaccharomyces pombe			HG xviii-Chr x
No. of contigs/chromosomes	i			3			1
Genome length (bp)	4 639 658			12 571 820			135 374 737
Library	200 bp	one kb	x kb	200 bp	1 kb	x kb	200 bp	1 kb	10 kb
Read length (bp)	35	35	35	35	35	35	75	75	75
Average insert size (bp)	235	1035	10035	235	1035	10035	275	1075	10075
Insert size range (average ± bp)	±40	±200	±2000	±40	±200	±2000	±40	±200	±2000
No. of paired reads (millions)	three.31	3.31	iii.31	8.98	8.98	8.98	45.12	ix.02	ix.02
Coverage	50×	50×	50×	50×	l×	fifty×	l×	10×	ten×
Seq. mistake rate, %	2.0	2.0	ii.0	2.0	2.0	2.0	2.0	2.0	2.0
Ligation error charge per unit, %	0.0	2.0	two.0	0.0	ii.0	two.0	0.0	2.0	2.0

Organism	Escherichia coli			Schizosaccharomyces pombe			HG 18-Chr 10
No. of contigs/chromosomes	one			iii			ane
Genome length (bp)	4 639 658			12 571 820			135 374 737
Library	200 bp	1 kb	ten kb	200 bp	1 kb	10 kb	200 bp	1 kb	10 kb
Read length (bp)	35	35	35	35	35	35	75	75	75
Average insert size (bp)	235	1035	10035	235	1035	10035	275	1075	10075
Insert size range (boilerplate ± bp)	±40	±200	±2000	±xl	±200	±2000	±40	±200	±2000
No. of paired reads (millions)	three.31	3.31	3.31	eight.98	viii.98	8.98	45.12	ix.02	nine.02
Coverage	50×	l×	50×	l×	50×	fifty×	50×	10×	x×
Seq. error rate, %	2.0	2.0	two.0	2.0	2.0	2.0	two.0	2.0	2.0
Ligation error rate, %	0.0	2.0	2.0	0.0	ii.0	2.0	0.0	2.0	two.0

Nosotros adjusted the following approach to evaluate each assembly effect. All contigs were aligned against the reference genome using BLAT (Kent, 2002). Whatsoever contig which does not completely align to the reference genome, while assuasive for small indels and mismatches, is deemed a 'big misassembly'. To evaluate the accuracy at micro level, we segmented the reference genome into continuous, non-overlapping sequences of yard bp and check if they can be mapped on the assembly's contigs without errors. The number of error-gratis segments that tin can be mapped on the contigs is cogitating of the accuracy of the assembly. Contiguity of the assembly is measured by the N50 and N90 sizes. The abyss of the assembly is evaluated past computing the percentage of reference genome covered by the assembled contigs. The computational complexity of each assembler is measured by its running time and memory usage. These evaluation steps are detailed in the Supplementary section.

The results for imitation data are summarized in Table 2. The experiments demonstrate that PE-Assembler tin generate highly contiguous assemblies at a very low error rate using less system resources. While Velvet is fast in execution, the number of misassemblies shows that it lags behind PE-Assembler and Allpaths in terms of accuracy. Both ABySS and SOAPdenovo produces highly fragment results with relatively modest N50 sizes.

Table 2.

Comparison of fake data results

	Escherichia coli					Schizosaccharomyces pombe					HG18 chr10
	PA	Velvet	Allpaths2	Completeness	Soap	PA	Velvet	Allpaths2	Completeness	Soap	PA	Abyss	SOAP

Parameters	−minol = 25	−yard = 27	−1000 = 21	−thou = 25	−chiliad = 27	−minol = 25	−k = 29	−g = 21	−k = 25	−k = 29	−minol = fifty	−k = 45	−chiliad = 31
		−cov = car		−j = three,−n = x	−pair_num = 3		−cov = motorcar		−j = 3 −north = x	pair_num = iii		−j = 3	−p = 20
		−exp = auto		−np = viii	−p = 8		−exp = auto		−np = viii	−p = 8		−n = 10
Contig statistics
No. of contigs (>200 bp)	six	56	44	283	199	31	181	164	650	348	4262	49015	18238
Average length (kb)	777.4	82606	107.half-dozen	22.3	22.8	394.vii	67.9	75.iii	23.1	35.0	30.2	ii.9	six.six
Maximum length (kb)	2492.6	708.half dozen	593.7	163.0	232.0	3519.half dozen	856.1	851.0	297.3	468.7	403.five	65.ii	155.viii
Contig N50 size (kb)	2492.half dozen	398.three	373.3	63.eight	49.9	1487.7	273.0	226.8	80.1	99.eight	62.4	5.three	13.0
Contig N90 size (kb)	2146.0	109.9	115.4	33.ix	12.4	363.half-dozen	54.4	59.5	36.seven	xix.0	11.ane	1.7	3.vi
Coverage (%)	100.00	99.59	99.85	99.00	98.87	97.78	99.35	98.60	98.38	98.91	90.89	92.04	87.14
Evaluation
Large misassemblies	0	11	0	1	one	0	17	0	1	iv	v	171	605
Segment maps (%)	99.68	94.74	99.18	93.31	93.66	96.42	94.44	96.83	92.72	94.31	86.17	63.61	32.ii
Performance ^a
Total execution time (min)	21	ten	227	43	5	101	40	734	98	11	748	N/A ^b	240
Peak retentivity usage (gb)	2.3	2.9	29.7	2.9	five.9	4.5	7.vii	66	half dozen	eight.1	15.1	N/A ^b	48.0

	Escherichia coli					Schizosaccharomyces pombe					HG18 chr10
	PA	Velvet	Allpaths2	Abyss	Lather	PA	Velvet	Allpaths2	Completeness	SOAP	PA	Abyss	SOAP

Parameters	−minol = 25	−chiliad = 27	−grand = 21	−yard = 25	−k = 27	−minol = 25	−k = 29	−k = 21	−1000 = 25	−k = 29	−minol = fifty	−k = 45	−chiliad = 31
		−cov = motorcar		−j = iii,−n = x	−pair_num = three		−cov = auto		−j = 3 −n = 10	pair_num = 3		−j = 3	−p = 20
		−exp = auto		−np = viii	−p = eight		−exp = automobile		−np = 8	−p = 8		−n = 10
Contig statistics
No. of contigs (>200 bp)	6	56	44	283	199	31	181	164	650	348	4262	49015	18238
Average length (kb)	777.4	82606	107.6	22.3	22.eight	394.7	67.nine	75.3	23.1	35.0	30.2	2.9	6.6
Maximum length (kb)	2492.6	708.6	593.7	163.0	232.0	3519.vi	856.ane	851.0	297.three	468.seven	403.5	65.2	155.viii
Contig N50 size (kb)	2492.vi	398.3	373.3	63.viii	49.ix	1487.7	273.0	226.8	80.1	99.8	62.4	5.three	thirteen.0
Contig N90 size (kb)	2146.0	109.9	115.4	33.9	12.4	363.6	54.4	59.v	36.7	19.0	11.1	one.7	iii.6
Coverage (%)	100.00	99.59	99.85	99.00	98.87	97.78	99.35	98.60	98.38	98.91	90.89	92.04	87.14
Evaluation
Large misassemblies	0	11	0	1	one	0	17	0	i	4	5	171	605
Segment maps (%)	99.68	94.74	99.eighteen	93.31	93.66	96.42	94.44	96.83	92.72	94.31	86.17	63.61	32.2
Performance ^a
Total execution time (min)	21	x	227	43	five	101	xl	734	98	11	748	N/A ^b	240
Peak memory usage (gb)	2.3	2.9	29.7	two.9	v.9	4.5	7.7	66	half dozen	viii.one	15.1	N/A ^b	48.0

^a Escherichai coli and S.pombe datasets were run using viii threads for PE-Assembler, SOAPdenovo and Completeness. HG18 Chr10 dataset was run using 20 threads for PE-Assembler and SOAPdenovo. For this dataset ABySS was run across four nodes in a cluster, each running two separate threads.

^bExecution time and retentivity usage not available for ABySS. Run into Supplementary Material.

Table ii.

Comparing of fake data results

	Escherichia coli					Schizosaccharomyces pombe					HG18 chr10
	PA	Velvet	Allpaths2	Abyss	Soap	PA	Velvet	Allpaths2	Abyss	SOAP	PA	Abyss	Lather

Parameters	−minol = 25	−k = 27	−g = 21	−k = 25	−k = 27	−minol = 25	−k = 29	−k = 21	−1000 = 25	−k = 29	−minol = 50	−m = 45	−k = 31
		−cov = auto		−j = 3,−n = 10	−pair_num = 3		−cov = auto		−j = three −due north = 10	pair_num = 3		−j = three	−p = xx
		−exp = car		−np = 8	−p = eight		−exp = auto		−np = viii	−p = 8		−n = x
Contig statistics
No. of contigs (>200 bp)	vi	56	44	283	199	31	181	164	650	348	4262	49015	18238
Average length (kb)	777.4	82606	107.six	22.three	22.viii	394.7	67.9	75.3	23.ane	35.0	thirty.2	two.9	6.half dozen
Maximum length (kb)	2492.six	708.half dozen	593.seven	163.0	232.0	3519.6	856.1	851.0	297.iii	468.vii	403.5	65.ii	155.viii
Contig N50 size (kb)	2492.6	398.3	373.3	63.8	49.9	1487.7	273.0	226.8	eighty.i	99.eight	62.4	5.3	13.0
Contig N90 size (kb)	2146.0	109.9	115.4	33.nine	12.4	363.6	54.four	59.v	36.7	xix.0	11.ane	1.7	3.six
Coverage (%)	100.00	99.59	99.85	99.00	98.87	97.78	99.35	98.60	98.38	98.91	90.89	92.04	87.xiv
Evaluation
Large misassemblies	0	11	0	one	ane	0	17	0	1	iv	5	171	605
Segment maps (%)	99.68	94.74	99.18	93.31	93.66	96.42	94.44	96.83	92.72	94.31	86.17	63.61	32.ii
Functioning ^a
Total execution fourth dimension (min)	21	10	227	43	5	101	forty	734	98	11	748	N/A ^b	240
Peak memory usage (gb)	2.iii	2.ix	29.7	2.9	5.9	4.5	7.7	66	six	viii.1	xv.1	North/A ^b	48.0

	Escherichia coli					Schizosaccharomyces pombe					HG18 chr10
	PA	Velvet	Allpaths2	Abyss	SOAP	PA	Velvet	Allpaths2	Abyss	SOAP	PA	Abyss	Soap

Parameters	−minol = 25	−k = 27	−k = 21	−k = 25	−k = 27	−minol = 25	−g = 29	−m = 21	−thousand = 25	−thousand = 29	−minol = 50	−k = 45	−thou = 31
		−cov = auto		−j = 3,−n = 10	−pair_num = three		−cov = machine		−j = 3 −n = ten	pair_num = iii		−j = 3	−p = xx
		−exp = automobile		−np = 8	−p = 8		−exp = auto		−np = 8	−p = viii		−n = 10
Contig statistics
No. of contigs (>200 bp)	6	56	44	283	199	31	181	164	650	348	4262	49015	18238
Average length (kb)	777.4	82606	107.half dozen	22.three	22.8	394.7	67.ix	75.three	23.ane	35.0	thirty.2	2.ix	half-dozen.vi
Maximum length (kb)	2492.6	708.vi	593.seven	163.0	232.0	3519.6	856.1	851.0	297.3	468.7	403.5	65.2	155.8
Contig N50 size (kb)	2492.half dozen	398.iii	373.iii	63.8	49.nine	1487.vii	273.0	226.viii	80.1	99.viii	62.4	5.3	thirteen.0
Contig N90 size (kb)	2146.0	109.9	115.4	33.nine	12.four	363.6	54.iv	59.5	36.7	19.0	11.1	ane.7	3.half-dozen
Coverage (%)	100.00	99.59	99.85	99.00	98.87	97.78	99.35	98.60	98.38	98.91	90.89	92.04	87.xiv
Evaluation
Large misassemblies	0	xi	0	1	1	0	17	0	1	4	v	171	605
Segment maps (%)	99.68	94.74	99.18	93.31	93.66	96.42	94.44	96.83	92.72	94.31	86.17	63.61	32.two
Functioning ^a
Total execution fourth dimension (min)	21	10	227	43	5	101	forty	734	98	11	748	Due north/A ^b	240
Top memory usage (gb)	2.three	2.ix	29.7	2.9	5.nine	4.v	7.7	66	6	8.i	15.one	Due north/A ^b	48.0

^a Escherichai coli and S.pombe datasets were run using eight threads for PE-Assembler, SOAPdenovo and ABySS. HG18 Chr10 dataset was run using xx threads for PE-Assembler and SOAPdenovo. For this dataset Completeness was run beyond four nodes in a cluster, each running two split up threads.

^bExecution fourth dimension and retentiveness usage not bachelor for Completeness. See Supplementary Textile.

To demonstrate that PE-Assembler is scalable to handle large genomes, we false three paired-end read libraries of aforementioned fragment sizes from chromosome x of HG18 and assembled using PE-Assembler. PE-Assembler tin encompass 90% of the original chromosome with N50 size exceeding sixty 000. ABySS and SOAPdenovo produces a large number of contigs with very depression N50 value. We failed to execute both Allpaths2 and Velvet for this dataset due to their high retention usage.

Furthermore, nosotros fake four libraries of 500 bp fragment paired-cease data of four unlike read lengths at threescore× coverage to assess the touch on of increase in read length on PE-Assembler. The results (Tabular array 3) show that PE-Assembler benefits from increase in read length and compares favorably confronting Velvet for all read lengths.

Table 3.

Performance of PE-Assembler using different read lengths

	Escherichia coli
	35 bp reads		50 bp reads		75 bp reads		100 bp reads
	PA	Velvet	PA	Velvet	PA	Velvet	PA	Velvet
No. of contigs (>600 bp)	73	90	64	89	67	83	53	eighty
Contig N50 size (kb)	124.vii	111.8	133.0	132.vi	144.0	132.eight	178.3	171.8
Contig N90 size (kb)	31.ix	35.0	35.two	31.v	35.2	forty.1	41.4	xl.1
Big misassembiles	0	1	0	1	1	0	1	i
Coverage (%)	98.78	98.76	99.eleven	99.08	98.91	99.37	98.73	99.36
Execution time ^a (min)	vii	6	7	six	half dozen	half dozen	vi	7
Acme memory usage (one thousand)	one.4	1.seven	1.four	2.one	1.3	2.seven	1.3	three.5

	Escherichia coli
	35 bp reads		fifty bp reads		75 bp reads		100 bp reads
	PA	Velvet	PA	Velvet	PA	Velvet	PA	Velvet
No. of contigs (>600 bp)	73	ninety	64	89	67	83	53	80
Contig N50 size (kb)	124.7	111.8	133.0	132.6	144.0	132.8	178.iii	171.8
Contig N90 size (kb)	31.9	35.0	35.2	31.v	35.2	40.one	41.4	40.1
Large misassembiles	0	1	0	1	1	0	1	1
Coverage (%)	98.78	98.76	99.11	99.08	98.91	99.37	98.73	99.36
Execution time ^a (min)	7	6	seven	6	half dozen	6	half dozen	vii
Top retentivity usage (g)	one.4	1.seven	ane.4	two.one	1.three	two.seven	1.iii	three.5

^aPE-Assembler was run using twenty parallel threads.

Velvet was run with following thou-values, respectively, 23, 31, 43 and 47. –cov_cutoff and –exp_cov was ready to car.

Table 3.

Performance of PE-Assembler using different read lengths

	Escherichia coli
	35 bp reads		fifty bp reads		75 bp reads		100 bp reads
	PA	Velvet	PA	Velvet	PA	Velvet	PA	Velvet
No. of contigs (>600 bp)	73	90	64	89	67	83	53	80
Contig N50 size (kb)	124.7	111.eight	133.0	132.6	144.0	132.eight	178.3	171.8
Contig N90 size (kb)	31.nine	35.0	35.2	31.5	35.ii	40.1	41.4	40.1
Large misassembiles	0	1	0	i	1	0	ane	1
Coverage (%)	98.78	98.76	99.11	99.08	98.91	99.37	98.73	99.36
Execution time ^a (min)	7	half dozen	seven	half dozen	6	6	six	7
Peak memory usage (g)	one.4	i.seven	i.4	2.ane	ane.3	2.seven	1.iii	iii.5

	Escherichia coli
	35 bp reads		l bp reads		75 bp reads		100 bp reads
	PA	Velvet	PA	Velvet	PA	Velvet	PA	Velvet
No. of contigs (>600 bp)	73	ninety	64	89	67	83	53	80
Contig N50 size (kb)	124.7	111.viii	133.0	132.6	144.0	132.8	178.3	171.viii
Contig N90 size (kb)	31.9	35.0	35.2	31.5	35.2	40.1	41.4	xl.1
Large misassembiles	0	1	0	1	1	0	1	one
Coverage (%)	98.78	98.76	99.eleven	99.08	98.91	99.37	98.73	99.36
Execution time ^a (min)	7	6	7	half-dozen	vi	6	vi	7
Meridian memory usage (thou)	1.iv	1.7	1.4	2.1	1.iii	2.7	1.3	3.five

^aPE-Assembler was run using twenty parallel threads.

Velvet was run with post-obit yard-values, respectively, 23, 31, 43 and 47. –cov_cutoff and –exp_cov was set to machine.

3.2 Experimental data

To appraise our approach against moisture lab data, we used four datasets provided with Allpaths2. Each dataset contains two paired-cease read libraries; i of approximate fragment length 200 bp and the other ranging from 3000 to 4500 bp (Table 4). The single reads were not used.

Table 4.

Details of the experimental datasets

Organism	Staphylococcus aureus		Escherichia coli		Schizosaccharomyces pombe		Neurospora crassa
No. of contigs/chromosomes	3		1		four		251
Genome length	2 903 107		4 638 902		12 554 318		39 225 835
Library (bp)	200	3000	200	3000	200	3000	200	3000
Read length (bp)	35	26	35	26	35	26	35	26
Average insert size (bp)	224	3845	210	3771	208	3658	210	3650
Insert size range (bp)	195–255	3175–4725	180–260	3026–4626	195–265	2935–4535	175–245	2875–4675
No. of paired reads (millions)	5.52	iii.89	xv.04	5.46	27.58	25.62	95.66	61.88
Approximate coverage	130×	35×	230×	60×	150×	110×	170×	80×

Organism	Staphylococcus aureus		Escherichia coli		Schizosaccharomyces pombe		Neurospora crassa
No. of contigs/chromosomes	three		one		4		251
Genome length	2 903 107		4 638 902		12 554 318		39 225 835
Library (bp)	200	3000	200	3000	200	3000	200	3000
Read length (bp)	35	26	35	26	35	26	35	26
Average insert size (bp)	224	3845	210	3771	208	3658	210	3650
Insert size range (bp)	195–255	3175–4725	180–260	3026–4626	195–265	2935–4535	175–245	2875–4675
No. of paired reads (millions)	5.52	three.89	xv.04	5.46	27.58	25.62	95.66	61.88
Estimate coverage	130×	35×	230×	60×	150×	110×	170×	lxxx×

Table four.

Details of the experimental datasets

Organism	Staphylococcus aureus		Escherichia coli		Schizosaccharomyces pombe		Neurospora crassa
No. of contigs/chromosomes	3		1		iv		251
Genome length	2 903 107		4 638 902		12 554 318		39 225 835
Library (bp)	200	3000	200	3000	200	3000	200	3000
Read length (bp)	35	26	35	26	35	26	35	26
Average insert size (bp)	224	3845	210	3771	208	3658	210	3650
Insert size range (bp)	195–255	3175–4725	180–260	3026–4626	195–265	2935–4535	175–245	2875–4675
No. of paired reads (millions)	5.52	three.89	fifteen.04	five.46	27.58	25.62	95.66	61.88
Judge coverage	130×	35×	230×	60×	150×	110×	170×	80×

Organism	Staphylococcus aureus		Escherichia coli		Schizosaccharomyces pombe		Neurospora crassa
No. of contigs/chromosomes	3		1		4		251
Genome length	two 903 107		four 638 902		12 554 318		39 225 835
Library (bp)	200	3000	200	3000	200	3000	200	3000
Read length (bp)	35	26	35	26	35	26	35	26
Average insert size (bp)	224	3845	210	3771	208	3658	210	3650
Insert size range (bp)	195–255	3175–4725	180–260	3026–4626	195–265	2935–4535	175–245	2875–4675
No. of paired reads (millions)	5.52	3.89	xv.04	5.46	27.58	25.62	95.66	61.88
Approximate coverage	130×	35×	230×	60×	150×	110×	170×	80×

As the reference genome is provided for every dataset, the evaluation criteria remained the same as Section iii.1. However, since the reference genome and the sequenced genome are non expected to exist identical, some minor errors are expected and allowed when we map the assembled contigs onto the reference genome. The results are summarized in Tabular array v. It shows that PE-Assembler is equally adept in handling experimental data. It records the highest contiguity in the form of N50 sizes across all four datasets.

Tabular array five.

Comparing of experimental data results

	Staphylococcus aureus				Escherichia coli
	PA	Velvet	Allpaths2	Completeness	PA	Velvet	Allpaths2	Abyss

Parameters	−minol = 25	−thousand = 23	−g = 21	−grand = 25	−minol = 25	−k = 27	−1000 = 21	−g = 25
		−cov = auto		−j = 2 −n = 10		−cov = 12		−j = 2 −n = 10
		−exp = motorcar		−np = 8		−exp = auto		−np = 8
Contig statistics
No. of contigs (>200 bp)	24	sixty	xiv	187	21	121	25	277
Average length (kb)	119.8	48.0	205.0	18.3	176.8	37.5	184.1	21.four
Maximum length (kb)	949.9	475.half dozen	1122.viii	175.ane	895.9	356.6	1015.three	160.4
Contig N50 size (kb)	685.8	314.9	477.2	63.8	428.viii	105.6	337.i	55.2
Contig N90 size (kb)	107.5	37.79	84.0	31.nine	143.1	25.4	81.seven	31.eight
Coverage (%)	99.45	98.99	99.24	98.28	99.56	99.19	99.63	98.96
Evaluation
Large misassemblies	0	5	0	1	0	4	0	1
Segment maps (%)	98.48	96.66	98.55	94.56	98.73	95.60	99.18	94.55
Operation ^a
Total execution time (min)	17	8	95	13	34	25	222	29
Elevation retentivity usage (gb)	1.nine	2.8	20	2.half dozen	3.3	6.ix	37.6	5.3

	Staphylococcus aureus				Escherichia coli
	PA	Velvet	Allpaths2	ABySS	PA	Velvet	Allpaths2	ABySS

Parameters	−minol = 25	−thou = 23	−yard = 21	−k = 25	−minol = 25	−k = 27	−k = 21	−yard = 25
		−cov = car		−j = 2 −n = ten		−cov = 12		−j = 2 −due north = 10
		−exp = auto		−np = 8		−exp = motorcar		−np = 8
Contig statistics
No. of contigs (>200 bp)	24	lx	14	187	21	121	25	277
Average length (kb)	119.8	48.0	205.0	xviii.3	176.8	37.v	184.ane	21.iv
Maximum length (kb)	949.9	475.six	1122.eight	175.ane	895.9	356.half-dozen	1015.3	160.4
Contig N50 size (kb)	685.eight	314.9	477.2	63.8	428.8	105.6	337.1	55.2
Contig N90 size (kb)	107.v	37.79	84.0	31.9	143.i	25.4	81.7	31.eight
Coverage (%)	99.45	98.99	99.24	98.28	99.56	99.xix	99.63	98.96
Evaluation
Large misassemblies	0	5	0	1	0	four	0	1
Segment maps (%)	98.48	96.66	98.55	94.56	98.73	95.lx	99.18	94.55
Performance ^a
Total execution time (min)	17	eight	95	13	34	25	222	29
Peak memory usage (gb)	1.ix	2.8	20	ii.6	3.3	6.9	37.half dozen	5.3

	Schizosaccharomyces pombe				Neurospora crassa
	PA	Velvet	Allpaths2	ABySS	PA	Velvet	Allpaths2	Completeness
Parameters	−minol = 25	−k = 25		−yard = 25	−minol = 25	−k = 25		−k = 25
		−cov = three		−j = 2 −n = 10		−cov = motorcar		−j = 2 −due north = x
		−exp = auto		−np = eight		−exp = automobile		−np = 16
Contig statistics
No. of contigs (>200 bp)	169	362	353	1028	2708	5079	1687	9916
Average length (kb)	72.1	33.7	33.8	13.0	12.eight	6.eight	xviii.3	3.8
Maximum length (kb)	571.1	443.0	257.2	136.eight	156.2	71.0	161.2	56.0
Contig N50 size (kb)	147.seven	110.6	50.0	36.0	twenty.7	eleven.6	17.6	viii.1
Contig N90 size (kb)	40.0	33.2	12.two	12.3	-	-	-	i.0
Coverage (%)	96.97	97.82	95.20	97.93	87.40	87.70	78.38	88.70
Evaluation
Large misassemblies	3	26	2	27	16	273	xviii	395
Segment maps (%)	95.51	94.26	92.sixty	91.08	82.06	77.44	74.66	71.28
Operation ^a
Total execution time (min)	364	125	4830 ^b	72	1416	266	5196 ^b	331
Pinnacle retentivity usage (gb)	vi.6	15	North/A	half-dozen.6	21	45	Northward/A	25.vi

	Schizosaccharomyces pombe				Neurospora crassa
	PA	Velvet	Allpaths2	ABySS	PA	Velvet	Allpaths2	ABySS
Parameters	−minol = 25	−k = 25		−chiliad = 25	−minol = 25	−g = 25		−k = 25
		−cov = 3		−j = two −n = 10		−cov = auto		−j = 2 −north = ten
		−exp = auto		−np = 8		−exp = auto		−np = 16
Contig statistics
No. of contigs (>200 bp)	169	362	353	1028	2708	5079	1687	9916
Average length (kb)	72.1	33.vii	33.8	13.0	12.8	6.8	18.3	3.8
Maximum length (kb)	571.i	443.0	257.ii	136.viii	156.ii	71.0	161.2	56.0
Contig N50 size (kb)	147.vii	110.6	50.0	36.0	20.7	11.six	17.vi	viii.one
Contig N90 size (kb)	40.0	33.2	12.2	12.3	-	-	-	1.0
Coverage (%)	96.97	97.82	95.20	97.93	87.xl	87.70	78.38	88.70
Evaluation
Large misassemblies	three	26	2	27	sixteen	273	18	395
Segment maps (%)	95.51	94.26	92.60	91.08	82.06	77.44	74.66	71.28
Operation ^a
Total execution time (min)	364	125	4830 ^b	72	1416	266	5196 ^b	331
Peak retentivity usage (gb)	6.6	15	Due north/A	6.6	21	45	N/A	25.half dozen

^aAll experiments were run in a 8-core motorcar except for N.crassa dataset, which was run using sixteen-cores.

^bReported as in Allpaths2 publication, where experiments were carried out in a 16-core machine.

Table 5.

Comparison of experimental information results

	Staphylococcus aureus				Escherichia coli
	PA	Velvet	Allpaths2	ABySS	PA	Velvet	Allpaths2	ABySS

Parameters	−minol = 25	−k = 23	−k = 21	−k = 25	−minol = 25	−1000 = 27	−yard = 21	−k = 25
		−cov = auto		−j = 2 −northward = 10		−cov = 12		−j = 2 −northward = 10
		−exp = auto		−np = viii		−exp = auto		−np = 8
Contig statistics
No. of contigs (>200 bp)	24	60	14	187	21	121	25	277
Average length (kb)	119.8	48.0	205.0	18.three	176.eight	37.5	184.1	21.4
Maximum length (kb)	949.9	475.vi	1122.8	175.1	895.9	356.vi	1015.three	160.4
Contig N50 size (kb)	685.viii	314.nine	477.2	63.8	428.8	105.6	337.1	55.two
Contig N90 size (kb)	107.5	37.79	84.0	31.9	143.1	25.4	81.seven	31.eight
Coverage (%)	99.45	98.99	99.24	98.28	99.56	99.nineteen	99.63	98.96
Evaluation
Large misassemblies	0	5	0	one	0	four	0	one
Segment maps (%)	98.48	96.66	98.55	94.56	98.73	95.60	99.18	94.55
Functioning ^a
Total execution fourth dimension (min)	17	8	95	13	34	25	222	29
Tiptop memory usage (gb)	ane.ix	2.viii	20	2.six	3.3	6.9	37.6	5.iii

	Staphylococcus aureus				Escherichia coli
	PA	Velvet	Allpaths2	ABySS	PA	Velvet	Allpaths2	Completeness

Parameters	−minol = 25	−k = 23	−thousand = 21	−thousand = 25	−minol = 25	−k = 27	−yard = 21	−k = 25
		−cov = auto		−j = 2 −n = x		−cov = 12		−j = 2 −n = 10
		−exp = car		−np = viii		−exp = auto		−np = 8
Contig statistics
No. of contigs (>200 bp)	24	sixty	14	187	21	121	25	277
Average length (kb)	119.eight	48.0	205.0	18.3	176.8	37.5	184.1	21.four
Maximum length (kb)	949.9	475.vi	1122.8	175.one	895.9	356.6	1015.3	160.four
Contig N50 size (kb)	685.8	314.9	477.two	63.eight	428.eight	105.6	337.1	55.2
Contig N90 size (kb)	107.5	37.79	84.0	31.9	143.1	25.iv	81.vii	31.8
Coverage (%)	99.45	98.99	99.24	98.28	99.56	99.xix	99.63	98.96
Evaluation
Large misassemblies	0	5	0	1	0	four	0	1
Segment maps (%)	98.48	96.66	98.55	94.56	98.73	95.60	99.18	94.55
Performance ^a
Total execution time (min)	17	8	95	xiii	34	25	222	29
Peak retentivity usage (gb)	1.9	2.8	20	ii.6	3.three	6.9	37.6	five.3

	Schizosaccharomyces pombe				Neurospora crassa
	PA	Velvet	Allpaths2	Completeness	PA	Velvet	Allpaths2	Completeness
Parameters	−minol = 25	−chiliad = 25		−m = 25	−minol = 25	−one thousand = 25		−k = 25
		−cov = 3		−j = two −n = x		−cov = car		−j = 2 −n = x
		−exp = machine		−np = eight		−exp = motorcar		−np = 16
Contig statistics
No. of contigs (>200 bp)	169	362	353	1028	2708	5079	1687	9916
Average length (kb)	72.1	33.7	33.8	xiii.0	12.8	vi.8	18.3	3.viii
Maximum length (kb)	571.i	443.0	257.2	136.8	156.2	71.0	161.2	56.0
Contig N50 size (kb)	147.7	110.6	50.0	36.0	xx.7	11.6	17.half-dozen	viii.1
Contig N90 size (kb)	40.0	33.2	12.two	12.3	-	-	-	1.0
Coverage (%)	96.97	97.82	95.20	97.93	87.forty	87.lxx	78.38	88.seventy
Evaluation
Large misassemblies	3	26	2	27	16	273	18	395
Segment maps (%)	95.51	94.26	92.sixty	91.08	82.06	77.44	74.66	71.28
Performance ^a
Total execution fourth dimension (min)	364	125	4830 ^b	72	1416	266	5196 ^b	331
Peak memory usage (gb)	six.half-dozen	fifteen	Northward/A	6.6	21	45	N/A	25.6

	Schizosaccharomyces pombe				Neurospora crassa
	PA	Velvet	Allpaths2	Abyss	PA	Velvet	Allpaths2	ABySS
Parameters	−minol = 25	−g = 25		−1000 = 25	−minol = 25	−k = 25		−k = 25
		−cov = iii		−j = 2 −north = 10		−cov = auto		−j = 2 −north = 10
		−exp = automobile		−np = 8		−exp = automobile		−np = 16
Contig statistics
No. of contigs (>200 bp)	169	362	353	1028	2708	5079	1687	9916
Average length (kb)	72.one	33.7	33.eight	13.0	12.8	6.eight	eighteen.3	3.eight
Maximum length (kb)	571.one	443.0	257.ii	136.viii	156.2	71.0	161.2	56.0
Contig N50 size (kb)	147.7	110.vi	fifty.0	36.0	20.7	11.6	17.six	eight.i
Contig N90 size (kb)	40.0	33.two	12.2	12.3	-	-	-	1.0
Coverage (%)	96.97	97.82	95.20	97.93	87.40	87.70	78.38	88.70
Evaluation
Large misassemblies	iii	26	two	27	16	273	18	395
Segment maps (%)	95.51	94.26	92.60	91.08	82.06	77.44	74.66	71.28
Functioning ^a
Full execution fourth dimension (min)	364	125	4830 ^b	72	1416	266	5196 ^b	331
Superlative memory usage (gb)	six.6	15	Northward/A	6.6	21	45	N/A	25.6

^aAll experiments were run in a viii-core machine except for Northward.crassa dataset, which was run using sixteen-cores.

^bReported every bit in Allpaths2 publication, where experiments were carried out in a 16-core car.

For the 2 smaller genomes, the coverage statistics are nearly identical for all iv approaches. Assemblies produced by Velvet and Abyss shows several large misassemblies whereas those of PE-Assembler and Allpaths2 are void of such errors. Performance-wise, PE-Assembler is more than efficient in retention consumption compared with all other programs. Peculiarly noteworthy is the big amount of retentivity consumed by Allpaths2 to assemble even the smallest of genomes.

Repeated attempts to assemble the two larger datasets using Allpaths2 failed in our organisation. We suspect this is due to high memory usage of Allpaths2. Therefore, the comparison is based on the output provided at Allpaths website. The timing quoted here is that reported on the Allpaths2 publication.

For the highly repetitive S.pombe genome, PE-Assembler results in an assembly with N50 and N90 sizes far greater than that of Allpaths2, Velvet and ABySS. PE-Assembler besides shows improve coverage than Allpaths2. The loftier number of large misassemblies in Velvet and Completeness assemblies demonstrates the susceptibility of de Bruijn graph approach to misassemble genomes in the presence of curt repeat regions. In contrast, PE-Assembler and Allpaths2 results in only three and 2 big misassemblies, respectively. Of the iii 'misassembled' contigs in PE-Assembler output, two of them tin be properly aligned against other strains of Due south.pombe and therefore they are likely due to differences between assembled strain and the reference. PE-Assembler'south assembly for S.pombe also results in the highest number of segments maps, testament to both its coverage and accurateness.

For the relatively larger Neurospora crassa genome, PE-Assembler'due south result leads in terms of contiguity and coverage. Annotation that Allpaths2'south assembly is of significantly depression coverage in comparison with other assemblies. Besides note that N.crassa reference genome is unfinished and information technology consists of many contigs. The 'large misassemblies' reported is likely to exist inflated.

Current version of SOAPdenovo ignores reads of length <35 bp for the scaffolding process. Therefore, we did not exam SOAPdenovo against the experimental datasets equally information technology would not be a fair comparison.

iii.3 Parallelization and running time

One of the nearly of import aspects of our method is its power to bear out the entire assembly process in parallel. We carried out a series of experiments to determine how parallelization affects the execution time of the assembler. The simulated E.coli dataset with 200 and 10 000 bp libraries were assembled using 1–8 separate threads in an 8-core-CPU machine. Each thread was executed in a separate CPU core.

Figure 7 shows that distributing each step across multiple CPU cores in parallel decreases the execution fourth dimension proportionally to the number of CPUs utilized. However, unlike the implementation in Allpaths2, the parallel implementation does not come up at an extra retentivity overhead as the data structures are shared by each thread. In each of the experiments, the maximum memory utilization was constant at i.three GB.

Fig. vii.

Execution time with respect to number of threads/cores utilized. Utilizing multiple cores dramatically reduces execution fourth dimension. Theoretically, the improvement should be linear with number of parallel threads; even so, this is masked by the fact that each step has constant IO overhead which cannot be parallelized.

4 DISCUSSION

PE-Assembler has demonstrated that it is possible to obtain complete and highly accurate de novo genome assemblies using high-throughput sequencing data inside reasonable time and retention constraints. The highlight of PE-Assembler is that it eschews the traditional graph-based approach in favor of a simple extension approach.

The advantages of this approach are numerous. Memory requirements of graph-based approaches seem to increase exponentially as genome and data size increase. This was highlighted by the inability of Velvet and Allpaths2 to cope with simulated HG18 Chr10 dataset. In contrast, PE-Assembler produced a very usable associates within a realistic memory limit.

Our approach is fundamentally similar to other 3′ extension approaches such equally SSAKE, SHARCGS and VCAKE, only distinguishes itself due to its extensive employ of paired-end reads. Not only does it make such arroyo scalable to larger genomes' datasets by localizing data, it also contributes to its high accuracy. As evident from both simulated and experimental data results, PE-Assembler is the least prone of all algorithms to misassemble unlike regions of the genome in a continuous segment.

Maybe the well-nigh important aspect of PE-Assembler is its ability to seamlessly parallelize the assembly process. Multiple threads can simultaneously assemble the genome at various positions beyond the genome, while a simple detection mechanism will ensure that multiple assemblies of the same region are highly unlikely. Also noteworthy is that parallel associates in PE-Assembler does not come up at an extra cost in retention as in other methods such as Allpaths2 or Abyss. Beingness able to massively parallelize the assembly process at no extra overhead, it will evidence valuable in assembling mammalian genomes too as in larger metagenomics projects. With small-scale modifications, this arroyo can be extended to be run in a computer cluster across multiple nodes to further subtract the running time.

ACKNOWLEDGEMENTS

The authors would like to extend their gratitude to Pauline Chen of Inquiry Computing Grouping, GIS, for her aid in evaluation and testing process. Nosotros further like to thank Daniel Zerbino for his help in running Velvet and the reviewers for their useful feedback and insight.

Funding: This inquiry was supported by MOE AcRF Tier 2 funding R-252-000-444-112 and Agency for Science, Engineering science and Research (A*STAR).

Disharmonize of Interest: none declared.

REFERENCES

, et al.

ALLPATHS: de novo assembly of whole-genome shotgun microreads

Genome Res.

2008

, vol.

eighteen

(pg.

810

820

)

, et al.

De novo fragment associates with short mate-paired reads: Does the read length affair?

Genome Res.

2009

, vol.

(pg.

336

346

)

, et al.

SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing

Genome Res.

2007

, vol.

(pg.

1697

1706

)

, et al.

Extending assembly of brusque DNA sequences to handle error

Bioinformatics

2007

, vol.

(pg.

2942

2944

)

BLAT–the Smash-like alignment tool

Genome Res.

2002

, vol.

(pg.

656

664

)

, et al.

De novo assembly of human genomes with massively parallel brusque read sequencing

Genome Res.

2010

, vol.

(pg.

265

272

)

, et al.

ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads

Genome Biol.

2009

, vol.

pg.

R103

, et al.

An Eulerian path approach to DNA fragment assembly

Proc. Natl Acad. Sci. The states

2001

, vol.

(pg.

9748

9753

)

, et al.

Completeness: a parallel assembler for curt read sequence data

Genome Res.

2009

, vol.

(pg.

1117

1123

)

, et al.

Assembling millions of short Deoxyribonucleic acid sequences using SSAKE

Bioinformatics

2007

, vol.

(pg.

500

501

)

, .

Velvet: algorithms for de novo short read assembly using de Bruijn graphs

Genome Res.

2008

, vol.

xviii

(pg.

821

829

)

Writer notes

Associate Editor: John Quackenbush

froelichoftere.blogspot.com

Source: https://academic.oup.com/bioinformatics/article/27/2/167/284712

Using Mate Paired Reads and Paired End in Soap De Novo

Abstract

1 INTRODUCTION

two METHODS

2.i Read screening

2.2 Seed building

ii.3 Contig extension

2.4 Scaffolding

ii.5 Gap filling

2.half dozen Parallelization

3 RESULTS

3.i Simulated data

3.2 Experimental data

iii.3 Parallelization and running time

4 DISCUSSION

ACKNOWLEDGEMENTS

REFERENCES

Writer notes

Belum ada Komentar untuk "Using Mate Paired Reads and Paired End in Soap De Novo"

Posting Komentar

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel