NovoAlign

From Research Computing Center Wiki
Jump to: navigation, search

Category

Bioinformatics

Program On

zcluster

Version

2.07.18

Author / Distributor

Novocraft Technologies Sdn Bhd, Kuala Lumpur, Malaysia. Any published works where free versions of Novocraft products have been used in data analysis should include an acknowledgment and a link to www.novocraft.com

Description

aligner for single and paired end short reads, more details at Novoalign website

Running Program

Also refer to Running Jobs on zcluster

Version 2.07.18 is at /usr/local/novoalign/2.07.18/

/usr/local/novoalign/latest/ always points to last version.

Example of a shell script novo.sh to run on at the batch queue:

#!/bin/bash
cd working_directory
time /usr/local/novoalign/latest/novoindex [options] 

Example of submission to the queue:

 qsub -q queueName ./novo.sh 

Documentation

WIKI and FAQ are available at Novocraft

/usr/local/novoalign/2.07.18/novoalign

Novoalign V2.07.18

Usage:
    novoalign options

Options:
    -d dbname      Full pathname of indexed reference sequence from novoindex
    --mmapOff      Turns off memory mapping for the index. By default the index
                   file is memory mapped allowing it to be shared by multiple
                   instances of Novoalign.
    --LockIdx      Use MAP_LOCKED flag when memory mapping the index.

Options for Read processing:
    -f read1 read2     Filenames for the read sequences for Side 1 & 2.
                   If only one file is specified then single end reads are processed.
                   If two files are specified then the program will operate in paired end mode.
    --hdrhd [9|off] Controls checking of identity between headers in paired end reads. 
                   Sets the Hamming Distance or disables the check. Default is a Hamming Distance of not 
                   more than 1. Processing will stop with appropriate error messages if 
                   Hamming Distance exceeds the limit.
    -F format      Specifies a read file format, refer to manual for full list of options.
                   For Fastq '_sequence.txt' files from Illumina
                       CASAVA 1.3 to 1.7 use -F ILMFQ.
                       CASAVA 1.8 and later use -F ILM1.8
                       Pre 1.3 use -F SLXFQ
                   QSEQ & ILM1.8 files include reads that have been flagged as low quality by the
                   base caller. Specify how these are processed with the following options:
    --ILQ_USE      Ignore QC flag and align the reads.
    --ILQ_SKIP     Skip the reads entirely (Default). They will not appear in the reports.
    --ILQ_QC       Do not align reads but include in report with QC flag.
    -H             Hard clip trailing bases with quality <= 2
    --Q2Off        Turns off treating Q=2 bases as 'Illumina Read Segment Quality Control Indicator'
    -l 99          Sets the minimum information content for a read in base pairs. Default log4(Ng) + 5
                   where Ng is the length of the reference genome. Measure uses base qualities
                   to determine information content of the read in bits and divides by 2 to get
                   effective length in bases.
    -n 99          Truncate reads to the specified length before alignment. Default is 150, maximum is 300.
    -p 99,99 [0.9,99]
                   Sets polyclonal filter thresholds.  The first pair of values (n,t) sets
                   the number of bases and threshold for the first 20 base pairs of each read.
                   If there are n or more bases with phred quality below t then the read is
                   flagged as polyclonal and will not be aligned. The alignment status is 'QC'.
                   The second pair of values applies to the entire read rather than just the first 20bp 
                   and is entered as fraction of bases in the read below the threshold.
                   Setting -p -1 disables the filter. Default -p -1,10 -1.00,10
    -a [read1 adapter] [read2 adapter]
                   Enables adapter stripping from 3' end of reads before aligning. The second
                   adapter is used for the second read in paired end mode.
                   Default adapter sequence for single end is TCGTATGCCGTCTTCTGCTTG.
                   Default adapter sequences for paired end reads are:
                           Read1: AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
                           Read2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
    -h 99 [99]     Sets homopolymer and optional dinucleotide filter score. Any read that
                   matches a homopolymer or dinuc with score less than or equal to this
                   threshold will skipped (reported as QC).
                   Default 20 for homopolymer and 20 for dinucleotides.
                   Bi-Seq default 120 for homopolymer and 20 for dinucleotides.

Options for alignment scoring:
    -t 99          Sets the maximum alignment score acceptable for the best alignment. Default Automatic.
                   In automatic mode the threshold is set based on read length, genome size and other
                   factors (see manual).
                   For pairs the threshold applies to the fragment and includes both ends and the length penalty.
    -g 99          Sets the gap opening penalty. Default 40
    -x 99          Sets the gap extend penalty. Default 6
    -u 99          Penalty for unconverted CHG or CHH cytosine in bisulfite alignment mode. Default 0
                   For plants 6 may be a good value.
    -b mode        Sets Bisulphite alignment mode. Values for mode are:
                        4 - Aligns in 4 possible combinations of direction and index. (Default)
                        2 - Aligns reads in forward direction using CT index and in reverse complement using the GA index.
    --WCBoth        If a Bi-Seq read maps to both Watson & Crick strands at the same location, same score, and same read 
                   orientation then Novoalign default behaviour is to report the alignment against only one strand chosen
                   at random. This option turns off the random selection so that two alignments will be reported, one for 
                   each strand.
    -N 999         Sets the number of bp of source DNA that are not represented in the reference sequences (index). This
                   value is used in calculation of prior probability that the read originated in sequence that we cannot 
                   align to because it is not in the reference sequence set. By default we use the number of bases coded 
                   as N's in the reference genome. Set to zero to disable inclusion of this in quality calculations.

Options for reporting:
    -o format [readgroup]
                   Specifies the report format. Native, Pairwise, SAM. Default is Native. 
                   Optionally followed by SAM @RG record.
                   Refer to manual for details additional options.
    -o Softclip    Turns on soft clipping of alignments. Default for SAM report format.
    -o FullNW      Turns off soft clipping of alignments. Default for Pairwise and Native report formats.
    -o Header text     Appends 'text' to ever read header in the output report.
    -o IUBMatch    In Native report format a match between query and an IUB ambiguous code in the reference will be 
                   reported in the list of mismatched bases.
    --rOQ          If quality calibration is on then write original base qualities as SAM OQ:Z: tag
    --rNMOri       If a read is unmapped report original read and qualities before any hard clipping.
    -R 99          Sets score difference between best and second best alignment for calling a repeat. Default 5.
    -r strategy [limit]
                   Sets strategy for reporting repeats. 'None', 'Random', 'All', 'Exhaustive',
                   or a posterior probability limit. Default None.
                   For -rAll & -rEx you can also specify a limit on the maximum number of alignments reported per read.
    -Q 99          Sets lower limit on alignment quality for reporting. Default 0.
    -e 999         Sets a limit on number of alignments for a single read.
                   This limit applies to the number of alignments with score equal to that of the best
                   alignment. Alignment process will stop when the limit is reached.
                   Default 1000 in default report mode, off for other modes.
    -q 9           Sets number of decimal places for quality score in Native report format. Default zero.
    --3Prime       Report mapping location of 3' end of read. In SAM report format this is Z3
                   tag. In Native report format it is attribute immediately after 5' mapping
                   location.
    -K [file]      Collects mismatch statistics for quality calibration by position in the read 
                   and called base quality. Mismatch counts are written to the named file after 
                   all reads are processed. When used with -k option the mismatch counts include 
                   any read from the input quality calibration file.
   --HPStats <filename>  Collect Insert/Delete statistics for homopolymers and write to named
                         file.

Paired End Options:
    -i [mode] 99[-|,]99
    -i MP 99[-|,]99 99[-|,]99
                   Sets approximate fragment length and standard deviation (comma separator) or 
                   a range of fragment lengths (hyphen separator).
                   Mode is:  'PE', '+-', or not specified for paired end reads
                             'MP' or '++' for ABI SOLiD mate pairs.
                   The mode changes the expected orientation of the reads in a proper pair.
                   Default -i PE 250,50
    -v 99          Sets the structural variation penalty for chimeric fragments. Default 70
    -v 99 99       Sets the structural variation penalty for chimeric fragments.
                    1) Penalty for SVs within one sequence
                    2) Penalty for SVs across different sequences.
    -v 99 99 99 regex
                   Sets the structural variation penalty for chimeric fragments. The three values are for:
                    1) Penalty for SVs within a group of sequences as defined by the regular expression.
                    2) Penalty for SVs within a single sequence
                    3) Penalty for SVs different sequence and group.
                   regex defines a regular expression applied to headers of indexed sequences. The regular
                   expression should define one field that is used to define sequence groups.

Single End Options:
    -m [99]        Sets miRNA mode. In this mode each alignment to a read is given an additional
                   score based on nearby alignment to the opposite strand of the read. Optional
                   parameter sets maximum distance in bp between alignment and it's reverse complement, Default 100bp.
                   Setting miRNA mode changes the default report mode to 'All'.
    -s 9           Turns on read trimming and sets trimming step size. Default step size is 2bp.
                   Unaligned reads are trimmed until they align or fail the QC tests.

 (c) 2008, 2009, 2010 NovoCraft Technologies Sdn Bhd

Back to Top

Installation

academic version downloaded from http://www.novocraft.com/download.html

System

64-bit Linux