Canu-Teaching
Category
Bioinformatics
Program On
Teaching
Version
1.6
Author / Distributor
Description
"Canu is a fork of the Celera Assembler, designed for high-noise single-molecule sequencing (such as the PacBio RS II or Oxford Nanopore MinION). Canu is a hierarchical assembly pipeline which runs in four steps: Detect overlaps in high-noise sequences using MHAP Generate corrected sequence consensus Trim corrected sequences Assemble trimmed corrected sequences" More details are at canu
Running Program
The last version of this application is at /usr/local/apps/eb/canu/1.6-foss-2016b
To use this version, please load the module with
ml canu/1.6-foss-2016b
Here is an example of a shell script, sub.sh, to run on the batch queue:
#!/bin/bash
#SBATCH --job-name=j_canu
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=canu.%j.out
#SBATCH --error=canu.%j.err
cd $SLURM_SUBMIT_DIR
ml canu/1.6-foss-2016b
canu [options]
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.
Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.
Here is an example of job submission command:
sbatch ./sub.sh
Documentation
ml canu/1.6-foss-2016b canu -h usage: canu [-version] [-citation] \ [-correct | -trim | -assemble | -trim-assemble] \ [-s <assembly-specifications-file>] \ -p <assembly-prefix> \ -d <assembly-directory> \ genomeSize=<number>[g|m|k] \ [other-options] \ [-pacbio-raw | -pacbio-corrected | -nanopore-raw | -nanopore-corrected] file1 file2 ... example: canu -d run1 -p godzilla genomeSize=1g -nanopore-raw reads/*.fasta.gz To restrict canu to only a specific stage, use: -correct - generate corrected reads -trim - generate trimmed reads -assemble - generate an assembly -trim-assemble - generate trimmed reads and then assemble them The assembly is computed in the -d <assembly-directory>, with output files named using the -p <assembly-prefix>. This directory is created if needed. It is not possible to run multiple assemblies in the same directory. The genome size should be your best guess of the haploid genome size of what is being assembled. It is used primarily to estimate coverage in reads, NOT as the desired assembly size. Fractional values are allowed: '4.7m' equals '4700k' equals '4700000' Some common options: useGrid=string - Run under grid control (true), locally (false), or set up for grid control but don't submit any jobs (remote) rawErrorRate=fraction-error - The allowed difference in an overlap between two raw uncorrected reads. For lower quality reads, use a higher number. The defaults are 0.300 for PacBio reads and 0.500 for Nanopore reads. correctedErrorRate=fraction-error - The allowed difference in an overlap between two corrected reads. Assemblies of low coverage or data with biological differences will benefit from a slight increase in this. Defaults are 0.045 for PacBio reads and 0.144 for Nanopore reads. gridOptions=string - Pass string to the command used to submit jobs to the grid. Can be used to set maximum run time limits. Should NOT be used to set memory limits; Canu will do that for you. minReadLength=number - Ignore reads shorter than 'number' bases long. Default: 1000. minOverlapLength=number - Ignore read-to-read overlaps shorter than 'number' bases long. Default: 500. A full list of options can be printed with '-options'. All options can be supplied in an optional sepc file with the -s option. Reads can be either FASTA or FASTQ format, uncompressed, or compressed with gz, bz2 or xz. Reads are specified by the technology they were generated with: -pacbio-raw <files> -pacbio-corrected <files> -nanopore-raw <files> -nanopore-corrected <files> Complete documentation at http://canu.readthedocs.org/en/latest/
Installation
Source code is obtained from canu
System
64-bit Linux