downpore

command module

v0.1.1 Latest Latest Go to latest Published: Jan 18, 2018 License: MIT Imports: 4 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/jteutenberg/downpore

Links

Open Source Insights

README ¶

downpore

A suite of tools for use in genome assembly and consensus. Work in progress.

Installation

Downpore is written in Go and requires no external libraries. It has been tested on Linux but should compile and run on Mac OS and Windows too.

Build from source

To build downpore from source:

install Go 1.8 or higher and git
create a directory $GOPATH/src/github.com/jteutenberg
from that directory, clone the downpore repository git clone https://github.com/jteutenberg/downpore.git
enter the new downpore directory and build it using go build downpore.go

Run precompiled binary

A stand-alone, pre-compiled binary for Linux x86-64 is available at https://github.com/jteutenberg/downpore/releases/latest

Usage

The general form of execution is downpore <command> [arguments] where arguments are provided with either a one- or two-hyphen switch followed by a space or equals sign and then its value.

Running downpore with no arguments gives a list of available commands.

To see the available arguments for a command, use downpore help <command>.

Read input format

Input reads must be a fasta/fastq file or a gzip (with .gz suffix) containing a fasta/fastq. Fasta files with multiple lines per sequence are not handled and will need to have end-lines removed.

Only DNA sequences composed of A,C,G and T are accepted. Other characters will be assigned one of these values.

Command: trim

The trim command is used to remove adapters or barcodes from the end of long reads. It broadly performs the same function as Porechop, and while downpore is less polished it is substantially faster.

Input reads are specified using the -i argument. Output reads are written to stdout and will be in the same format as the input reads.

Usage examples:

downpore trim -i reads.fastq -f ./data/adapters_front.fasta -b ./data/adapters_back.fasta > trimmed.fastq

downpore trim -i reads.fastq -f ./data/adapters_front.fasta -b ./data/adapters_back.fasta --himem true --num_workers 32 > trimmed.fastq

Trim overview

The trim command uses k-mer matching and chaining to find sub-sequences of reads that match any adapter/barcode from a list provided by the user. Adapters found in the middle of long reads cause that read to be split.

Example adapter lists can be found in data/adapters_front.fasta and data/adapters_back.fasta.

Where possible, arguments and their default values mirror those of Porechop. Important functional differences are that downpore trim does not pair up front and back adapters (all are treated as being independent); and there is no notion of good and bad sides of an internal adapter: both are conservatively treated as bad.

Matching criteria

Adapters are identified with high recall at the edges (first and last 150 bases) and with higher precision in the middle of a read.

By default, 6-mers are used in matching. At the edges, an adapter is identified as being present when at least 3 k-mers match in-order and at approximately the expected distance from one another. This means that at a minimum 8 contiguous matching bases are present.

Internal matches are made when the "identity" matching passes the threshold specified by -middle_threshold (default 85%). The identity value is taken as the percentage of bases in the adapter that are contained in at least one matching k-mer.

Sequence labels

By default, the adapter/barcode with the most bases present in a read has its name appended to the beginning of the read's label in the trimmed output. This can be turned off using -tag_adapters.

Adapters with names beginning "Barcode" are a special case. These take precedence and will always be used in the output label if found, though the trimming will still be based on all adapters present. If there exist two barcodes that are within 5% identity in a read (i.e. ambiguous barcodes) then no adapter labels will be written. Take care: this has not been thoroughly tested.

Trim arguments

input the input reads file
front_adapters fasta containing adapters found at the beginning of reads
back_adapters fasta containing adapters found at the end of reads
k the size of k-mers to use in matching
chunk_size the size to split long input reads into when indexing
middle_threshold percentage identity to match an internal adapter
discard_middle whether or not to keep the splits resulting from an internal match
check_reads number of reads to consider when determining adapters present in the data
adapter_threshold the identity to match an adapter when performing check_reads
extra_end_trim number of bases to trim around adapters at the front and back
extra_middle_trim number of bases to trim around internal adapters
tag_adapters whether or not to modify output labels
verbosity a level of logging output (written to stderr)
num_workers number of threads to use to build the search index for internal adapters
himem whether or not to cache input reads in memory

De-multiplexing

The trim command does not explicitly de-multiplex data. If this is required then the sequence names of the output (which include barcode labels added by downpore) can be used by an external script to partition them.

Porechop performance comparison

The main use case for the downpore trim command is for those situations in which Porechop is the bottleneck in your pipeline, or possibly when there are memory constraints. In terms of performance, downpore is I/O bound as it makes two passes through the input file, whereas Porechop is CPU bound. As such, all comparisons below are based on wall-clock time.

When the --himem argument is set the trim uses only a single pass through the input file and this has been tested separately.

Performance test data

Three data sets:

E.coli small: 500MB of E.coli, R9.4 from Birmingham Uni (Bham_20171116_1xRAD004_6000ng.pass.fastq)
E.coli: 1.5GB E.coli, R9.2 from Loman lab (E_coli_K12_1D_R9.2_SpotON_2.pass.fasta)
Human: 2GB of chromosome 20 (na12878.chr20ScrappieFiltered.fasta)

Results

Trimming was performed on a low spec 4-core machine with an SSD.

Data	Speedup	Speedup (HiMem)	downpore memory	downpore (HiMem) memory	Porechop memory
E.coli small	31x	34x	0.3GB	1.1GB	1.1GB
E.coli small .gz	6x	11x	0.3GB	1.1GB	1.1GB
E.coli	25x		1.7GB		3.6GB
E.coli .gz	11x		1.7GB		3.6GB
Human	31x		1.3GB		4.6GB
Human .gz	12x		1.3GB		4.6GB

In terms of adapters found, downpore typically finds a few percent more at the edges of reads. In the test examples, Porechop also applies a back adapter (which trimmed ~3-5% of reads) that is not clearly present in the first 10k reads but was included based on its association with a front adapter.

With downpore's middle_threshold set to 80 (rather than default 85) the two sets of splits are typically the same. Porechop reports a higher number of middle adapters as it includes some near the edges that downpore treats as a front/back trim instead.

For those repeating the above tests, also note that Porechop finds a number of false positive splits in E.coli. These come from an 86% identity match to the short back adapter that is actually present in the E.coli genome.