HpcGridRunner-Sapelo2

From Research Computing Center Wiki
Jump to navigation Jump to search


Category

Tools

Program On

Sapelo2

Version

1.0.2

Author / Distributor

Please see https://github.com/HpcGridRunner/HpcGridRunner

Description

This tool will help in executing a file full of commands in parallel using a compute cluster.

Running Program

For instruction on running HpcGridRunner in conjunction with Trinity, please see here.

Also refer to Running Jobs on Sapelo2.

  • HpcGridRunner/1.0.2-GCCcore-12.3.0
  • HpcGridRunner/1.0.2-GCCcore-14.2.0

To use HpcGridRunner, please first load the module with

ml HpcGridRunner/1.0.2-GCCcore-14.2.0


Running a blast job in parallel by splitting the fasta file

HpcGridRunner submission script parameters:

  • --grid_conf: The grid_conf file specified below will request one CPU core and 3GB of RAM. You can change this by copying the .conf file to your own directory and editing it. More instructions on .conf file parameters below.
  • --cmd_template: The cmd_template can be change to reflect your blast command. Leave the -query parameter set to __QUERY_FILE__ value and hpc_FASTA_Gridrunner will fill it dynamically. You can also change the values for outfmt, evalue, max_target_seqs and db.
    • -num_threads: How many threads you want each blastp command to use. Be sure to request enough CPU cores in the .conf file (see conf file configuration instructions below)
  • --seqs_per_bin:
  • --query_fasta: You will need to change the YOUR_QUERY_FASTA.fasta to the name of your fasta file.
  • --out_dir: The blast results for the each of the split group will be in a separate file under the specified output directory. The file name will end in fa.OUT.

Please request the same number of threads with "-num_threads" as cpu cores with "--cpus-per-task" in the .conf file (not in the submission script Slurm headers) specified with "--grid_conf". You only need to request a single cpu core in the actual submission script Slurm headers.

#!/bin/bash
#SBATCH --partition batch
#SBATCH --ntasks=1
#SBATCH --time=48:00:00
#SBATCH --mem=3gb

ml HpcGridRunner/1.0.2
ml BLAST+/2.13.0-gompi-2022a

hpc_FASTA_GridRunner.pl \
--grid_conf=$HPCGR_CONF_DIR/sapelo_1c_3g.conf \
--cmd_template "blastp -query __QUERY_FILE__ -db /db/uniprot/latest/uniprot_sprot  -max_target_seqs 1 -outfmt 6 -evalue 1e-5 -num_threads 4" \
--seqs_per_bin 250 \
--query_fasta YOUR_QUERY_FASTA.fasta \
--out_dir blast_result

To gather the blast output you can generally concatenate all the individual outputs. You can use the following command to find and concatenate all the output files.

find OUTPUT_DIR -name '*.fa.OUT' | xargs cat > combined_blast_results.out

Configuration file contents:

  • cmd: This is the actual command that will be used to submit the HpcGridRunner command from the submission script specified in "--cmd_template". You will want to request enough resources for a single blastp command here.
    • --mem: Request enough memory for a single blastp command based on how many seqs_per_bin you choose. This could be around 20-80gb.
    • -n: Leave this at 1
    • -c: This is how many CPU cores you want. Set this equal to whatever you put for "-num_threads" in the "--cmd_template" in your submission script.
  • max_nodes: How many compute nodes you want running your blast commands. The more you request here, the faster it will go through all of your fasta file bins. It is suggested to request at least 80-100 nodes if not more, but this is up to your preference. The more nodes you have running jobs, the harder it can be to keep track of jobs if you are in the trouble-shooting phase.
  • cmds_per_node: This is how many blast cmds you want to run on each node. It is suggested to leave this equal to 1 because these cmds run sequentially anyways (rather than concurrently) so it does not save time to increase this value in this case.

This specific sample configuration file can be found at /apps/eb/HpcGridRunner/1.0.2-GCCcore-12.3.0/hpc_conf/sapelo_1c_3g.conf, but you can copy it to your own directory to modify, or simply create a new text file and fill it in with this information

# This was adapted for Slurm from:
# /usr/local/apps/hpcgridrunner/1.0.2/hpc_conf/sapelo_1c_3G_AMD.conf
# which was on the old (Torque/Moab) Sapelo2

# grid type:
grid=SLURM

# template for a grid submission:
cmd=sbatch -p batch --mem 16gb -n 1 -c 4 -t 6:00:00

##########################################################################################
# settings below configure the Trinity job submission system, not tied to the grid itself.
##########################################################################################

# number of grid submissions to be maintained at steady state by the Trinity submission system
max_nodes=10

# number of commands that are batched into a single grid submission job.
cmds_per_node=1

Documentation

Please see http://hpcgridrunner.github.io/

Back to Top

Installation

Installed in /apps/eb/HpcGridRunner/1.0.2

System

64-bit Linux