Tutorial

This pages gives a brief tutorial on using MetaCortex to assemble a set of illumina reads. The SRA toolkit will need to be installed to obtain the reads used in this tutorial, and we will use trim-galore to trim the reads before assembly.

First, clone the MetaCortex repository into your chosen directory and build the source. For this tutorial, we will use the value 31 for the max $k$-mer value:

git clone https://github.com/SR-Martin/metacortex.git
cd metacortex
make MAXK=31 metacortex

Mac users will need to include the MAC=1 flag. LLVM will not compile MetaCortex, so Mac users will also need to install GCC, and set CC to point to the GCC binary e.g.:

export CC=/usr/local/bin/gcc-11
make MAXK=31 MAC=1 metacortex

If you wish, add the binary metacortex_k31 in the bin directory to the ${PATH} variable. Next create a new directory for the tutorial data, and navigate to this directory. Download the reads using the fastq-dump command from the SRA toolkit:

fastq-dump SRR961514

Next, trim the reads using adapter-trimming software (e.g. trim-galore):

trim_galore SRR961514.fastq

Note that MetaCortex does not make use of paired end data, so there is no need to retain this during trimming. Create a text file consisting of the name of the file containing the trimmed reads:

echo SRR961514_trimmed.fq >> reads.txt

Now we are ready to run MetaCortex! The read set contains around 17m unique 31-mers, so following the examples in Memory Usage we set $n=18$ and $b=100$. We will set the -M flag so that we get only one contig per connected component in the de Bruijn graph, set the minimum contig length to 500, and set the minimum coverage to 100. We will use the MCC algorithm to assemble this dataset, and also set the -G flag to obtain gfa and fastg files:

metacortex_k31 -k 31 -n 18 -b 100 -i reads.txt -t fastq -f contigs.fa -l log.txt -A MCC -G -M -C 100 -g 500

This is a relatively small dataset so assembly should only take around 15 minutes on a modern machine. Once complete, you should find the file contigs.fa in the directory, which contains a single fasta sequence of length 9188bp. The files contigs.fastg and contigs.gfa contain sequence graphs in the respective format (note: GFA v2 is used), representing local variation along the contig.