Canu-Teaching: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
(Created page with "Category:TeachingCategory:SoftwareCategory:Bioinformatics === Category === Bioinformatics === Program On === Teaching === Version === 1.4 === Aut...")
 
No edit summary
 
(34 intermediate revisions by the same user not shown)
Line 9: Line 9:


=== Version ===
=== Version ===
1.4
1.7
   
   
=== Author / Distributor ===
=== Author / Distributor ===
Line 21: Line 21:
=== Running Program ===
=== Running Program ===


The last version of this application is at /usr/local/apps/eb/canu/1.4-foss-2016b
The last version of this application is at /usr/local/apps/eb/canu/1.7-foss-2016b


To use this version, please loads the module with
To use this version, please load the module with
<pre class="gscript">
<pre class="gscript">
ml canu/1.4-foss-2016b
ml canu/1.7-foss-2016b  
 
canu
</pre>  
</pre>  


Here is an example of a shell script, sub.sh, to run on at the batch queue:  
Here is an example of a shell script, sub.sh, to run on the batch queue:  


<div class="gscript2">
<div class="gscript2">
Line 42: Line 40:
<nowiki>#</nowiki>SBATCH --time=<u>08:00:00</u><br>   
<nowiki>#</nowiki>SBATCH --time=<u>08:00:00</u><br>   
<nowiki>#</nowiki>SBATCH --output=canu.%j.out<br>
<nowiki>#</nowiki>SBATCH --output=canu.%j.out<br>
<nowiki>#</nowiki>SBATCH --error=canu.%j.err<br>
   
   
cd $SLURM_SUBMIT_DIR<br>
cd $SLURM_SUBMIT_DIR<br>
 
ml canu/1.7-foss-2016b<br>  
ml <!-- TEST_COMMAND BEGIN --><!-- TEST_COMMAND END --> <u>[options]</u><br>   
canu <u>[options]</u><br>   
</div>
</div>
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values or be reviewed .   
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.   


Please refer to [[Running_Jobs_on_the_teaching_cluster]], [[Running_Jobs_on_the_teaching_cluster#Running_an_X-windows_application | Run X window Jobs]] and [[Running_Jobs_on_the_teaching_cluster#How_to_open_an_interactive_session | Run interactive Jobs]] for more details of running jobs at Teaching cluster.
Please refer to [[Running_Jobs_on_the_teaching_cluster]], [[Running_Jobs_on_the_teaching_cluster#Running_an_X-windows_application | Run X window Jobs]] and [[Running_Jobs_on_the_teaching_cluster#How_to_open_an_interactive_session | Run interactive Jobs]] for more details of running jobs at Teaching cluster.
Line 60: Line 59:
   
   
<pre  class="gcommand">
<pre  class="gcommand">
canu/1.4-foss-2016b
ml canu/1.7-foss-2016b  
canu -h
canu -h


usage: canu [-correct | -trim | -assemble | -trim-assemble] \
usage:   canu [-version] [-citation] \
            [-s <assembly-specifications-file>] \
              [-correct | -trim | -assemble | -trim-assemble] \
            -p <assembly-prefix> \
              [-s <assembly-specifications-file>] \
            -d <assembly-directory> \
              -p <assembly-prefix> \
            genomeSize=<number>[g|m|k] \
              -d <assembly-directory> \
            errorRate=0.X \
              genomeSize=<number>[g|m|k] \
            [other-options] \
              [other-options] \
            [-pacbio-raw | -pacbio-corrected | -nanopore-raw | -nanopore-corrected] *fastq
              [-pacbio-raw |
              -pacbio-corrected |
              -nanopore-raw |
              -nanopore-corrected] file1 file2 ...
 
example: canu -d run1 -p godzilla genomeSize=1g -nanopore-raw reads/*.fasta.gz
 


  By default, all three stages (correct, trim, assemble) are computed.
   To restrict canu to only a specific stage, use:
   To compute only a single stage, use:
     -correct      - generate corrected reads
     -correct      - generate corrected reads
     -trim          - generate trimmed reads
     -trim          - generate trimmed reads
Line 79: Line 83:
     -trim-assemble - generate trimmed reads and then assemble them
     -trim-assemble - generate trimmed reads and then assemble them


   The assembly is computed in the (created) -d <assembly-directory>, with most
   The assembly is computed in the -d <assembly-directory>, with output files named
   files named using the -p <assembly-prefix>.
   using the -p <assembly-prefix>.  This directory is created if needed.  It is not
  possible to run multiple assemblies in the same directory.


   The genome size is your best guess of the genome size of what is being assembled.
   The genome size should be your best guess of the haploid genome size of what is being
  It is used mostly to compute coverage in reads.  Fractional values are allowed: '4.7m'
  assembled. It is used primarily to estimate coverage in reads, NOT as the desired
  is the same as '4700k' and '4700000'
  assembly size.  Fractional values are allowed: '4.7m' equals '4700k' equals '4700000'


   The errorRate is not used correctly (we're working on it).  Don't set it
   Some common options:
  If you want to change the defaults, use the various utg*ErrorRate options.
    useGrid=string
      - Run under grid control (true), locally (false), or set up for grid control
        but don't submit any jobs (remote)
    rawErrorRate=fraction-error
      - The allowed difference in an overlap between two raw uncorrected reads.  For lower
        quality reads, use a higher number.  The defaults are 0.300 for PacBio reads and
        0.500 for Nanopore reads.
    correctedErrorRate=fraction-error
      - The allowed difference in an overlap between two corrected reads.  Assemblies of
        low coverage or data with biological differences will benefit from a slight increase
        in this.  Defaults are 0.045 for PacBio reads and 0.144 for Nanopore reads.
    gridOptions=string
      - Pass string to the command used to submit jobs to the grid.  Can be used to set
        maximum run time limitsShould NOT be used to set memory limits; Canu will do
        that for you.
    minReadLength=number
      - Ignore reads shorter than 'number' bases long.  Default: 1000.
    minOverlapLength=number
      - Ignore read-to-read overlaps shorter than 'number' bases long.  Default: 500.
  A full list of options can be printed with '-options'.  All options can be supplied in
  an optional sepc file with the -s option.


  A full list of options can be printed with '-options'.  All options
   Reads can be either FASTA or FASTQ format, uncompressed, or compressed with gz, bz2 or xz.
  can be supplied in an optional sepc file.
  Reads are specified by the technology they were generated with, and any processing performed:
 
     -pacbio-raw        <files>     Reads are straight off the machine.
   Reads can be either FASTA or FASTQ format, uncompressed, or compressed
     -pacbio-corrected  <files>     Reads have been corrected.
  with gz, bz2 or xz. Reads are specified by the technology they were
  generated with:
     -pacbio-raw        <files>
     -pacbio-corrected  <files>
     -nanopore-raw      <files>
     -nanopore-raw      <files>
     -nanopore-corrected <files>
     -nanopore-corrected <files>


Complete documentation at http://canu.readthedocs.org/en/latest/
Complete documentation at http://canu.readthedocs.org/en/latest/
ERROR:  Invalid command line option '-h'.  Did you forget quotes around options with spaces?
ERROR:  Assembly name prefix not supplied with -p.
ERROR:  Directory not supplied with -d.
ERROR:  Invalid 'corErrorRate' specified; must be set
ERROR:  Required parameter 'genomeSize' is not set





Latest revision as of 15:40, 15 August 2018

Category

Bioinformatics

Program On

Teaching

Version

1.7

Author / Distributor

canu

Description

"Canu is a fork of the Celera Assembler, designed for high-noise single-molecule sequencing (such as the PacBio RS II or Oxford Nanopore MinION). Canu is a hierarchical assembly pipeline which runs in four steps: Detect overlaps in high-noise sequences using MHAP Generate corrected sequence consensus Trim corrected sequences Assemble trimmed corrected sequences" More details are at canu

Running Program

The last version of this application is at /usr/local/apps/eb/canu/1.7-foss-2016b

To use this version, please load the module with

ml canu/1.7-foss-2016b 

Here is an example of a shell script, sub.sh, to run on the batch queue:

#!/bin/bash
#SBATCH --job-name=j_canu
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=canu.%j.out
#SBATCH --error=canu.%j.err

cd $SLURM_SUBMIT_DIR
ml canu/1.7-foss-2016b
canu [options]

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.

Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.


Here is an example of job submission command:

sbatch ./sub.sh 

Documentation

ml canu/1.7-foss-2016b 
canu -h

usage:   canu [-version] [-citation] \
              [-correct | -trim | -assemble | -trim-assemble] \
              [-s <assembly-specifications-file>] \
               -p <assembly-prefix> \
               -d <assembly-directory> \
               genomeSize=<number>[g|m|k] \
              [other-options] \
              [-pacbio-raw |
               -pacbio-corrected |
               -nanopore-raw |
               -nanopore-corrected] file1 file2 ...

example: canu -d run1 -p godzilla genomeSize=1g -nanopore-raw reads/*.fasta.gz 


  To restrict canu to only a specific stage, use:
    -correct       - generate corrected reads
    -trim          - generate trimmed reads
    -assemble      - generate an assembly
    -trim-assemble - generate trimmed reads and then assemble them

  The assembly is computed in the -d <assembly-directory>, with output files named
  using the -p <assembly-prefix>.  This directory is created if needed.  It is not
  possible to run multiple assemblies in the same directory.

  The genome size should be your best guess of the haploid genome size of what is being
  assembled.  It is used primarily to estimate coverage in reads, NOT as the desired
  assembly size.  Fractional values are allowed: '4.7m' equals '4700k' equals '4700000'

  Some common options:
    useGrid=string
      - Run under grid control (true), locally (false), or set up for grid control
        but don't submit any jobs (remote)
    rawErrorRate=fraction-error
      - The allowed difference in an overlap between two raw uncorrected reads.  For lower
        quality reads, use a higher number.  The defaults are 0.300 for PacBio reads and
        0.500 for Nanopore reads.
    correctedErrorRate=fraction-error
      - The allowed difference in an overlap between two corrected reads.  Assemblies of
        low coverage or data with biological differences will benefit from a slight increase
        in this.  Defaults are 0.045 for PacBio reads and 0.144 for Nanopore reads.
    gridOptions=string
      - Pass string to the command used to submit jobs to the grid.  Can be used to set
        maximum run time limits.  Should NOT be used to set memory limits; Canu will do
        that for you.
    minReadLength=number
      - Ignore reads shorter than 'number' bases long.  Default: 1000.
    minOverlapLength=number
      - Ignore read-to-read overlaps shorter than 'number' bases long.  Default: 500.
  A full list of options can be printed with '-options'.  All options can be supplied in
  an optional sepc file with the -s option.

  Reads can be either FASTA or FASTQ format, uncompressed, or compressed with gz, bz2 or xz.
  Reads are specified by the technology they were generated with, and any processing performed:
    -pacbio-raw         <files>      Reads are straight off the machine.
    -pacbio-corrected   <files>      Reads have been corrected.
    -nanopore-raw       <files>
    -nanopore-corrected <files>

Complete documentation at http://canu.readthedocs.org/en/latest/


Back to Top

Installation

Source code is obtained from canu

System

64-bit Linux