Meraculous-Sapelo2: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
(Created page with "Category:Sapelo2Category:SoftwareCategory:Other === Category === Other === Program On === Sapelo2 === Version === 2.25, 2.27-1-AVX, 2.27-1-AVX2, 2.27-1-AVX2-CUDA...")
 
No edit summary
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[Category:Sapelo2]][[Category:Software]][[Category:Other]]   
[[Category:Sapelo2]][[Category:Software]][[Category:Other]]   
=== Category ===
=== Category ===
Other
Bioinformatics
=== Program On ===
=== Program On ===
Sapelo2
Sapelo2
=== Version ===
=== Version ===
2.25, 2.27-1-AVX, 2.27-1-AVX2, 2.27-1-AVX2-CUDA-10
2.2.6
===Author / Distributor===
===Author / Distributor===
Please see http://magma.maths.usyd.edu.au/magma/: "Magma is distributed by the Computational Algebra Group at the University of Sydney."
Please see https://jgi.doe.gov/data-and-tools/software-tools/meraculous/: "Meraculous is a whole genome assembler for Next Generation Sequencing data geared for large genomes."
===Description===
===Description===
From http://magma.maths.usyd.edu.au/magma/: "Magma is a large, well-supported software package designed for computations in algebra, number theory, algebraic geometry and algebraic combinatorics. It provides a mathematically rigorous environment for defining and working with structures such as groups, rings, fields, modules, algebras, schemes, curves, graphs, designs, codes and many others. Magma also supports a number of databases designed to aid computational research in those areas of mathematics which are algebraic in nature. "
From https://sourceforge.net/projects/meraculous20/<nowiki/>, "Meraculous-2D is a whole genome assembler for NGS reads (Illumina) that is capable of assembling large, diploid genomes with modest computational requirements.
 
Features include:
 
- Efficient k-mer counting and deBruijn graph traversal
 
- Two modes of handling of diploid allelic variation
 
- Improved scaffolding that produces more complete assemblies without compromising scaffolding accuracy."
 
=== Running Program ===
=== Running Program ===
   
   
Line 15: Line 24:


For more information on Environment Modules on Sapelo2 please see the [[Lmod]] page.
For more information on Environment Modules on Sapelo2 please see the [[Lmod]] page.
*Version 2.2.6, installed as Conda virtual environment in /apps/eb/Meraculous/2.2.6
To use this version of Meraculous, please first load the module with
<pre class="gscript">
module load Meraculous/2.2.6
</pre>Please note:


* To run Meraculous, in your current job working folder, you need to prepare a configuration file that contains the parameters guiding the entire assembly process. This configuration file must be passed to the program with the -c <configuration file> argument.
* The assembly is driven by a perl pipeline which performs data fragmentation and load balancing, as well as submission and monitoring of multiple task arrays on a SLURM-type cluster (Sapelo2) or a standalone multi-core server.


*Version 2.25 (CPU version), installed in /apps/gb/Magma-AU/2.25


To use this version of magma, please first load the module with
<pre class="gscript">
module load Magma-AU/2.25
</pre>
*Version 2.27-1-AVX (CPU version), installed in /apps/gb/Magma-AU/2.27-1-AVX


To use this version of magma, please first load the module with
'''Example of how to run Meraculous in a standalone multi-core server on batch'''
<pre class="gscript">
module load Magma-AU/2.27-1-AVX
</pre>
*Version 2.27-1-AVX2 (CPU version), installed in /apps/gb/Magma-AU/2.27-1-AVX2


To use this version of magma, please first load the module with
1. Create a configuration file in your current working folder. In the example below this file is called meraculous.standalone.config and its content is
<pre class="gscript">
#Describe the libraries ( one line per library )
module load Magma-AU/2.27-1-AVX2
lib_seq /scratch/zhuofei/meraculous/OT1_CKDN220054653-1A_HF33VDSX5_L1_R1_paired.fq,/scratch/zhuofei/meraculous/OT1_CKDN220054653-1A_HF33VDSX5_L1_R2_paired.fq GERMAC1 200 20 150 0 0 1 1 1 0 0
</pre>Use the --constraint Slurm header with the value EDR to ensure that your job lands on a node that supports AVX2.
genome_size 2.15
mer_size 31
diploid_mode 2
num_prefix_blocks 4
min_depth_cutoff 3
use_cluster 0


*Version 2.27-1-AVX2-CUDA-10 (GPU version), installed in /apps/gb/Magma-AU/2.27-1-AVX2-CUDA-10


To use this version of magma, please first load the module with
2. Create a job submission script, called sub.sh in the example here, with the sample content:
<pre class="gscript">
#!/bin/bash
module load Magma-AU/2.27-1-AVX2-CUDA-10
#SBATCH --job-name=meraculoue_standalone
</pre>
#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=128gb
#SBATCH --time=7-00:00:00
#SBATCH --output=log.%j.out
#SBATCH --error=log.%j.err
cd $SLURM_SUBMIT_DIR
ml Meraculous/2.2.6
run_meraculous.sh -c meraculous.standalone.config -dir output -cleanup_level 1
The parameters of the job, such as the maximum wall clock time (--time), maximum memory (--mem), CPU cores (--cpus-per-task), and the job name (--job-name) need to be modified appropriately. In this example, the standard output and error of the run_meraculous.sh command will be saved into two files called log.JobID.out and log.JobID.err, respectively, where jobID is the jobid number.


Use the --constraint Slurm header with the value EDR to ensure that your job lands on a node that supports AVX2.
3. Submit the job to the queue with
<pre class="gcommand">
sbatch sub.sh
</pre>'''Example of how to run Meraculous cluster mode (multiple task arrays) on batch'''


When running Magma, please request the node feature '''magma''', as only nodes with this feature are included in our Magma license file. To do that, please add the Slurm option '''--constraint=magma''', as in the sample job submission script below.
1. Create a configuration file in your current working folder. In the example below this file is called meraculous.cluster.config and its content is
#Describe the libraries ( one line per library )
lib_seq /scratch/zhuofei/meraculous/OT1_CKDN220054653-1A_HF33VDSX5_L1_R1_paired.fq,/scratch/zhuofei/meraculous/OT1_CKDN220054653-1A_HF33VDSX5_L1_R2_paired.fq GERMAC1 200 20 150 0 0 1 1 1 0 0
genome_size 2.15
mer_size 31
diploid_mode 2
num_prefix_blocks 4
min_depth_cutoff 3
use_cluster 1
cluster_num_nodes 10
cluster_slots_per_task 8
cluster_ram_request 128
cluster_walltime 168:00:00
cluster_queue batch


where '''use_cluster 1''' specifies to use a cluster for job submissions.


'''Example of how to run magma in a batch job'''
'''cluster_num_nodes''' is specifies the number of available cluster compute nodes. This number can be approximate (mentioned in Meraculous user manual). In test, we observed some pipeline steps (e.g., gapCloser) can use more nodes (e.g. 15) than the number you specified here (e.g. 10).


1. Create a magma script. In the example below this script is called test.txt and its content is
'''cluster_slots_per_task''' specifies the maximum number of CPU cores to be allocated for multi-threaded elements in task arrays
<pre class="gscript">
print 3+5;
</pre>


2. Create a job submission script, called sub.sh in the example here, with the sample content
'''cluster_ram_request''' specifies the maximum amount of memory (GB) to be allocated for each element in task arrays; If elements of the task array are multi-threaded, Meraculous will automatically divide this number by the number of allocated CPU cores. In the example given above, at some pipeline steps, each CPU core can get 16 GB memory, that is (the number of cluster_ram_request (128)) / (the number of cluster_slots_per_task (8)) = 16.


<pre class="gscript">
'''cluster_walltime''' specifies the wall-clock time limit for cluster tasks. It must be specified as hh:mm:ss (168 hours is the upper limit for Sapelo2 batch partition).
#!/bin/bash
#SBATCH --job-name=testmagma       
#SBATCH --partition=batch         
#SBATCH --ntasks=1                 
#SBATCH --nodes=1
#SBATCH --mem=5gb                   
#SBATCH --time=12:00:00             
#SBATCH --output=%x.%j.out   
#SBATCH --error=%x.%j.err   
#SBATCH --constraint=magma,EDR    #request the node features magma & EDR


'''cluster_queue''' specifies the name of partition to which cluster jobs will be dispatched.


cd $SLURM_SUBMIT_DIR


module load Magma-AU/2.27-1-AVX2
2. Create a job submission script, called sub.sh in the example here, with the sample content:
#!/bin/bash
#SBATCH --job-name=meraculoue_cluster
#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=128gb
#SBATCH --time=168:00:00
#SBATCH --output=log.%j.out
#SBATCH --error=log.%j.err
cd $SLURM_SUBMIT_DIR
ml Meraculous/2.2.6
export SLURM_ROOT=/opt/apps/slurm/21.08.8
run_meraculous.sh -c meraculous.cluster.config -dir output -cleanup_level 1


magma test.txt > output_${SLURM_JOB_ID}.log
The parameters of the job, such as the maximum wall clock time (--time), maximum memory (--mem), CPU cores (--cpus-per-task), and the job name (--job-name) need to be modified appropriately. In this example, the standard output and error of the run_meraculous.sh command will be saved into two files called log.JobID.out and log.JobID.err, respectively, where jobID is the jobid number.
</pre>
where test.txt needs to be replaced by the name of your magma program.  Other parameters of the job, such as the maximum wall clock time, maximum memory, and the job name need to be modified appropriately as well. In this example, the standard output of the magma command will be saved into a file called "output_${SLURM_JOB_ID}.log", where ${SLURM_JOB_ID} will automatically be replaced by the jobid number.


   
   
Line 85: Line 136:


=== Documentation ===
=== Documentation ===
Tutorials and user guide are available at http://magma.maths.usyd.edu.au/magma/documentation/
User guide Manual.pdf is available at /apps/eb/Meraculous/2.2.6/share/doc/meraculous/Manual.pdf . Please feel free to download it to your local computer to browse.


=== Installation ===
=== Installation ===
*Binaries for Intel and AMD processors with AVX support downloaded from http://magma.maths.usyd.edu.au/magma/
*https://anaconda.org/bioconda/meraculous


=== System ===
=== System ===
64-bit Linux
64-bit Linux

Latest revision as of 13:57, 14 December 2022

Category

Bioinformatics

Program On

Sapelo2

Version

2.2.6

Author / Distributor

Please see https://jgi.doe.gov/data-and-tools/software-tools/meraculous/: "Meraculous is a whole genome assembler for Next Generation Sequencing data geared for large genomes."

Description

From https://sourceforge.net/projects/meraculous20/, "Meraculous-2D is a whole genome assembler for NGS reads (Illumina) that is capable of assembling large, diploid genomes with modest computational requirements.

Features include:

- Efficient k-mer counting and deBruijn graph traversal

- Two modes of handling of diploid allelic variation

- Improved scaffolding that produces more complete assemblies without compromising scaffolding accuracy."

Running Program

Also refer to Running Jobs on Sapelo2

For more information on Environment Modules on Sapelo2 please see the Lmod page.

  • Version 2.2.6, installed as Conda virtual environment in /apps/eb/Meraculous/2.2.6

To use this version of Meraculous, please first load the module with

module load Meraculous/2.2.6

Please note:

  • To run Meraculous, in your current job working folder, you need to prepare a configuration file that contains the parameters guiding the entire assembly process. This configuration file must be passed to the program with the -c <configuration file> argument.
  • The assembly is driven by a perl pipeline which performs data fragmentation and load balancing, as well as submission and monitoring of multiple task arrays on a SLURM-type cluster (Sapelo2) or a standalone multi-core server.


Example of how to run Meraculous in a standalone multi-core server on batch

1. Create a configuration file in your current working folder. In the example below this file is called meraculous.standalone.config and its content is

#Describe the libraries ( one line per library )
lib_seq /scratch/zhuofei/meraculous/OT1_CKDN220054653-1A_HF33VDSX5_L1_R1_paired.fq,/scratch/zhuofei/meraculous/OT1_CKDN220054653-1A_HF33VDSX5_L1_R2_paired.fq GERMAC1 200 20 150 0 0 1 1 1 0 0

genome_size 2.15
mer_size 31
diploid_mode 2
num_prefix_blocks 4
min_depth_cutoff 3

use_cluster 0


2. Create a job submission script, called sub.sh in the example here, with the sample content:

#!/bin/bash
#SBATCH --job-name=meraculoue_standalone
#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=128gb
#SBATCH --time=7-00:00:00
#SBATCH --output=log.%j.out
#SBATCH --error=log.%j.err

cd $SLURM_SUBMIT_DIR

ml Meraculous/2.2.6

run_meraculous.sh -c meraculous.standalone.config -dir output -cleanup_level 1

The parameters of the job, such as the maximum wall clock time (--time), maximum memory (--mem), CPU cores (--cpus-per-task), and the job name (--job-name) need to be modified appropriately. In this example, the standard output and error of the run_meraculous.sh command will be saved into two files called log.JobID.out and log.JobID.err, respectively, where jobID is the jobid number.


3. Submit the job to the queue with

sbatch sub.sh

Example of how to run Meraculous cluster mode (multiple task arrays) on batch

1. Create a configuration file in your current working folder. In the example below this file is called meraculous.cluster.config and its content is

#Describe the libraries ( one line per library )
lib_seq /scratch/zhuofei/meraculous/OT1_CKDN220054653-1A_HF33VDSX5_L1_R1_paired.fq,/scratch/zhuofei/meraculous/OT1_CKDN220054653-1A_HF33VDSX5_L1_R2_paired.fq GERMAC1 200 20 150 0 0 1 1 1 0 0

genome_size 2.15
mer_size 31
diploid_mode 2
num_prefix_blocks 4
min_depth_cutoff 3

use_cluster 1
cluster_num_nodes 10
cluster_slots_per_task 8
cluster_ram_request 128
cluster_walltime 168:00:00
cluster_queue batch

where use_cluster 1 specifies to use a cluster for job submissions.

cluster_num_nodes is specifies the number of available cluster compute nodes. This number can be approximate (mentioned in Meraculous user manual). In test, we observed some pipeline steps (e.g., gapCloser) can use more nodes (e.g. 15) than the number you specified here (e.g. 10).

cluster_slots_per_task specifies the maximum number of CPU cores to be allocated for multi-threaded elements in task arrays

cluster_ram_request specifies the maximum amount of memory (GB) to be allocated for each element in task arrays; If elements of the task array are multi-threaded, Meraculous will automatically divide this number by the number of allocated CPU cores. In the example given above, at some pipeline steps, each CPU core can get 16 GB memory, that is (the number of cluster_ram_request (128)) / (the number of cluster_slots_per_task (8)) = 16.

cluster_walltime specifies the wall-clock time limit for cluster tasks. It must be specified as hh:mm:ss (168 hours is the upper limit for Sapelo2 batch partition).

cluster_queue specifies the name of partition to which cluster jobs will be dispatched.


2. Create a job submission script, called sub.sh in the example here, with the sample content:

#!/bin/bash
#SBATCH --job-name=meraculoue_cluster
#SBATCH --partition=batch
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=128gb
#SBATCH --time=168:00:00
#SBATCH --output=log.%j.out
#SBATCH --error=log.%j.err

cd $SLURM_SUBMIT_DIR

ml Meraculous/2.2.6

export SLURM_ROOT=/opt/apps/slurm/21.08.8

run_meraculous.sh -c meraculous.cluster.config -dir output -cleanup_level 1

The parameters of the job, such as the maximum wall clock time (--time), maximum memory (--mem), CPU cores (--cpus-per-task), and the job name (--job-name) need to be modified appropriately. In this example, the standard output and error of the run_meraculous.sh command will be saved into two files called log.JobID.out and log.JobID.err, respectively, where jobID is the jobid number.


3. Submit the job to the queue with

sbatch sub.sh

Documentation

User guide Manual.pdf is available at /apps/eb/Meraculous/2.2.6/share/doc/meraculous/Manual.pdf . Please feel free to download it to your local computer to browse.

Installation

System

64-bit Linux