Cromwell-Sapelo2

From Research Computing Center Wiki
Jump to navigation Jump to search

Category

Tools

Program On

Sapelo2

Version

56

Author / Distributor

Broad Institute

Description

"Cromwell is a Workflow Management System geared towards scientific workflows. Cromwell is open sourced under the BSD 3-Clause license." cromwell.readthedocs.io

Running Program

Please also refer to Running Jobs on Sapelo2.

  • Cromwell 56 is installed for use with Java 11.
    • module load cromwell/56-Java-11

Requirements

To execute Cromwell as a job on Sapelo2, the following are required:

  1. Cromwell Configuration File (required)
    1. Defines how each step in the workflow should be initialized.
  2. WDL File (required)
    1. Defines the workflow itself.
  3. Inputs File (optional but recommended)
    1. Defines the inputs to the workflow.
  4. Options File (optional)
    1. Defines any additional options.
  5. Job Submission Script (required)

Example Requirements

Example Configuration File

Cromwell requires a configuration file that includes instructions for how to execute workflows.

The maintainers of Cromwell provide short and intuitive documentation and tutorials to help understand and write a Cromwell configuration file:

Reviewing the content at the links above can help to understand the following Cromwell configuration file that has been adapted for Sapelo2 (based on their SLURM example).

The following file can also be found at /usr/local/training/Cromwell/cromwell-gacrc.conf:

cromwell-gacrc.conf
backend {
  default = slurm

  providers {
    slurm {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        runtime-attributes = """
        String partition = "batch"
        Int ntasks = 1
        Int cpus_per_task = 8
        Int memory = 8000
        Int time = 10
        """
        submit = """
            sbatch \
                --job-name=${job_name} \
                --partition=${partition} \
                --ntasks=${ntasks} \
                --cpus-per-task=${cpus_per_task} \
                --mem=${memory} \
                --time=${time} \
                --output=${out} \
                --error=${err} \
                --chdir=${cwd} \
                --wrap "/usr/bin/env bash ${script}"
        """
        kill = "scancel ${job_id}"
        check-alive = "squeue -j ${job_id}"
        job-id-regex = "Submitted batch job (\\d+).*"
      }
    }
  }
}
Example WDL (Workflow Description Language) File

Cromwell executes workflows written in WDL (Cromwell Language Support). The Cromwell maintainers provide an example WDL in their documentation.

The following workflow incorporates the same Bowtie2 example covered in the Sapelo2 training workshop, and can be found at /usr/local/training/Cromwell/cromwell-bowtie2.wdl:

cromwell-bowtie2.wdl
workflow CromwellBowtie2 {
    File input_fq
    File index_dir
    String index_name
    Int cpus_per_task

    call Bowtie2 {
        input:
            input_fq = input_fq,
            index_dir = index_dir,
            index_name = index_name,
            cpus_per_task = cpus_per_task,
    }
}

task Bowtie2 {
    File input_fq
    File index_dir
    String index_name
    Int cpus_per_task

    command {
        bowtie2 -p ${cpus_per_task} -x ${index_dir}/${index_name} -U ${input_fq} > alignments.output
    }
    output {
        File out = "alignments.output"
    }
}

A WDL file contains a task-by-task description of a workflow. The first block is the workflow block, wherein tasks are called. Each task is described in its own task block. To an extent, it can be helpful to consider the workflow block as analogous to the main function, and the task blocks as analogous to functions.

For more thorough information about WDL, refer to their language specification documentation. More WDL examples can be found here.

While the paths to input data can be written in the WDL file directly, it is considered best practice to supply them at runtime instead for re-usability. This is a convenient feature when importing WDL files from other groups, as it removes the need to edit hardcoded values.

Example Workflow Input File

In Cromwell, Workflow Input Files are written in JSON. They are specified with the --inputs flag when Cromwell is executed at the command line. These files define the requirements of the workflow, such as input files, or other input values. Specifying these input values in a separate file prevents the need to hardcode inputs in the original workflow file.

From the above Example WDL File, the CromwellBowtie2 workflow utilizes the following values:

workflow CromwellBowtie2 {
    File input_fq
    File index_dir
    String index_name
    Int cpus_per_task

The following JSON file provides definitions for each of these values, and can be found at /usr/local/training/Cromwell/inputs.json:

inputs.json
{   "CromwellBowtie2.input_fq": "myreads.fq",
    "CromwellBowtie2.index_dir": "index",
    "CromwellBowtie2.index_name": "lambda_virus",
    "CromwellBowtie2.cpus_per_task": "8"
}

Usage: --inputs input.json

In an input file, variables are referenced using the workflow name followed by a period and the variable name CromwellBowtie2.input_fq.

The example data referenced in this JSON file can be found at the following locations:

  • /usr/local/training/Cromwell/index
  • /usr/local/training/Cromwell/myreads.fq
Example Workflow Options File

In Cromwell, Workflow Options Files, are also written in JSON. They are specified with the --options flag when Cromwell is executed at the command line. These files describe the options to use during the execution of a workflow.

By default, the output of a workflow step is stored in that step's execution directory.

The following JSON file makes use of Cromwell's Output Copying capabilities to copy the output into a directory named output, and can be found at /usr/local/training/Cromwell/options.json:

options.json
{   "final_workflow_outputs_dir": "output",
    "use_relative_output_paths": true
}

Usage: --options options.json

Without specifying an alternative output directory, the output would be in a location similar to the following:

  • ./cromwell-executions/CromwellBowtie2/04d44744-2a84-4b2f-bea6-492985543ace/call-Bowtie2/execution/

Where cromwell-executions is a subdirectory of the working directory, the 04d44744-2a84-4b2f-bea6-492985543ace directory is named at runtime, and execution was the working directory during the execution of the task, Bowtie2.

Example Job submission Script

The following is an example job submission script that utilizes the files described above, and can be found at /usr/local/training/Cromwell/cromwell-sub.sh:

cromwell-sub.sh
#!/bin/bash
#SBATCH --job-name=cromwell-bowtie2
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=8gb
#SBATCH --time=00:10:00
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err

module load cromwell/56-Java-11
module load Bowtie2/2.4.5-GCC-11.3.0

cd $SLURM_SUBMIT_DIR

java \
	-Xmx8g \
	-Dconfig.file=cromwell-gacrc.conf \
	-jar $EBROOTCROMWELL/cromwell.jar \
	run cromwell-bowtie2.wdl \
	--inputs inputs.json \
	--options options.json

Where:

  • -Xmx8g instructs the Java Virtual Machine to allocate 8g of memory, which is equal to the amount requested in the SLURM header (--mem=8gb).
  • -Dconfig.file=cromwell-gacrc.conf is the path to the configuration file.
  • -jar $EBROOTCROMWELL/cromwell.jar is the Java Archive to run, which in this case is the Cromwell executable.
  • run cromwell-bowtie2.wdl contains the subcommand, run, and instructs Cromwell to run the workflow in Command Line mode.
  • --inputs inputs.json specifies the workflow inputs are defined in the inputs.json file.
  • --options options.json specifies any additional workflow options are defined in the options.json file.

Running the example

To run the above example, navigate to scratch and copy the files into the working directory:

mkdir /scratch/$USER/cromwell-example
cd /scratch/$USER/cromwell-example
cp -r /usr/local/training/Cromwell/* ./

Once copied, the job can be submitted with sbatch:

sbatch cromwell-sub.sh

Installation

  • Version 56: Installed using EasyBuild.

System

  • 64-bit Linux