Cromwell-Sapelo2: Difference between revisions

From Research Computing Center Wiki
Jump to navigation Jump to search
(Some lines continue to overlap despite attempts to fix)
(Formatting)
 
(One intermediate revision by the same user not shown)
Line 138: Line 138:
In Cromwell, ''Workflow Input Files'' are written in JSON. They are specified with the <code>''--inputs''</code> flag when Cromwell is executed at the [https://cromwell.readthedocs.io/en/stable/CommandLine/#run command line]. These files define the requirements of the workflow, such as input files, or other input values. Specifying these input values in a separate file prevents the need to hardcode inputs in the original workflow file.
In Cromwell, ''Workflow Input Files'' are written in JSON. They are specified with the <code>''--inputs''</code> flag when Cromwell is executed at the [https://cromwell.readthedocs.io/en/stable/CommandLine/#run command line]. These files define the requirements of the workflow, such as input files, or other input values. Specifying these input values in a separate file prevents the need to hardcode inputs in the original workflow file.


Continuing with the above Example WDL File, the <code>CromwellBowtie2</code> workflow utilizes the following values:<syntaxhighlight>
From the above [[Cromwell-Sapelo2#Example WDL .28Workflow Description Language.29 File|Example WDL File]], the <code>CromwellBowtie2</code> workflow utilizes the following values:<syntaxhighlight>
workflow CromwellBowtie2 {
     File input_fq
     File input_fq
     File index_dir
     File index_dir
     String index_name
     String index_name
     Int threads
     Int cpus_per_task
</syntaxhighlight>The following JSON file provides definitions for each of these values, and can be found at <code>/usr/local/training/Cromwell/inputs.json</code>:
</syntaxhighlight>The following JSON file provides definitions for each of these values, and can be found at <code>/usr/local/training/Cromwell/inputs.json</code>:


Line 152: Line 153:
     "CromwellBowtie2.cpus_per_task": "8"
     "CromwellBowtie2.cpus_per_task": "8"
}
}
</syntaxhighlight><code>--inputs input.json</code>  
</syntaxhighlight>'''<u>Usage:</u>'''    <code>--inputs input.json</code>  


In an input file, variables are referenced using the workflow name followed by a period and the variable name <code>CromwellBowtie2.input_fq</code>.
In an input file, variables are referenced using the '''workflow name''' followed by a '''period''' and the '''variable name''' <code>CromwellBowtie2.input_fq</code>.


The example data referenced in this JSON file can be found at the following locations:
The example data referenced in this JSON file can be found at the following locations:
Line 173: Line 174:
     "use_relative_output_paths": true
     "use_relative_output_paths": true
}
}
</syntaxhighlight><code>--options options.json</code>
</syntaxhighlight>'''<u>Usage:</u>'''    <code>--options options.json</code>


Without specifying an alternative output directory, the output would be in a location similar to the following:
Without specifying an alternative output directory, the output would be in a location similar to the following:


<code>./cromwell-executions/CromwellBowtie2/04d44744-2a84-4b2f-bea6-492985543ace/call-Bowtie2/execution/</code>
* <code>./cromwell-executions/CromwellBowtie2/04d44744-2a84-4b2f-bea6-492985543ace/call-Bowtie2/execution/</code>


Where <code>cromwell-executions</code> is a subdirectory of the working directory, the <code>04d44744-2a84-4b2f-bea6-492985543ace</code> directory is named at runtime, and <code>execution</code> was the working directory during the execution of the task, ''<code>Bowtie2</code>''.
Where <code>cromwell-executions</code> is a subdirectory of the working directory, the <code>04d44744-2a84-4b2f-bea6-492985543ace</code> directory is named at runtime, and <code>execution</code> was the working directory during the execution of the task, ''<code>Bowtie2</code>''.

Latest revision as of 10:55, 27 February 2024

Category

Tools

Program On

Sapelo2

Version

56

Author / Distributor

Broad Institute

Description

"Cromwell is a Workflow Management System geared towards scientific workflows. Cromwell is open sourced under the BSD 3-Clause license." cromwell.readthedocs.io

Running Program

Please also refer to Running Jobs on Sapelo2.

  • Cromwell 56 is installed for use with Java 11.
    • module load cromwell/56-Java-11

Requirements

To execute Cromwell as a job on Sapelo2, the following are required:

  1. Cromwell Configuration File (required)
    1. Defines how each step in the workflow should be initialized.
  2. WDL File (required)
    1. Defines the workflow itself.
  3. Inputs File (optional but recommended)
    1. Defines the inputs to the workflow.
  4. Options File (optional)
    1. Defines any additional options.
  5. Job Submission Script (required)

Example Requirements

Example Configuration File

Cromwell requires a configuration file that includes instructions for how to execute workflows.

The maintainers of Cromwell provide short and intuitive documentation and tutorials to help understand and write a Cromwell configuration file:

Reviewing the content at the links above can help to understand the following Cromwell configuration file that has been adapted for Sapelo2 (based on their SLURM example).

The following file can also be found at /usr/local/training/Cromwell/cromwell-gacrc.conf:

cromwell-gacrc.conf
backend {
  default = slurm

  providers {
    slurm {
      actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
      config {
        runtime-attributes = """
        String partition = "batch"
        Int ntasks = 1
        Int cpus_per_task = 8
        Int memory = 8000
        Int time = 10
        """
        submit = """
            sbatch \
                --job-name=${job_name} \
                --partition=${partition} \
                --ntasks=${ntasks} \
                --cpus-per-task=${cpus_per_task} \
                --mem=${memory} \
                --time=${time} \
                --output=${out} \
                --error=${err} \
                --chdir=${cwd} \
                --wrap "/usr/bin/env bash ${script}"
        """
        kill = "scancel ${job_id}"
        check-alive = "squeue -j ${job_id}"
        job-id-regex = "Submitted batch job (\\d+).*"
      }
    }
  }
}
Example WDL (Workflow Description Language) File

Cromwell executes workflows written in WDL (Cromwell Language Support). The Cromwell maintainers provide an example WDL in their documentation.

The following workflow incorporates the same Bowtie2 example covered in the Sapelo2 training workshop, and can be found at /usr/local/training/Cromwell/cromwell-bowtie2.wdl:

cromwell-bowtie2.wdl
workflow CromwellBowtie2 {
    File input_fq
    File index_dir
    String index_name
    Int cpus_per_task

    call Bowtie2 {
        input:
            input_fq = input_fq,
            index_dir = index_dir,
            index_name = index_name,
            cpus_per_task = cpus_per_task,
    }
}

task Bowtie2 {
    File input_fq
    File index_dir
    String index_name
    Int cpus_per_task

    command {
        bowtie2 -p ${cpus_per_task} -x ${index_dir}/${index_name} -U ${input_fq} > alignments.output
    }
    output {
        File out = "alignments.output"
    }
}

A WDL file contains a task-by-task description of a workflow. The first block is the workflow block, wherein tasks are called. Each task is described in its own task block. To an extent, it can be helpful to consider the workflow block as analogous to the main function, and the task blocks as analogous to functions.

For more thorough information about WDL, refer to their language specification documentation. More WDL examples can be found here.

While the paths to input data can be written in the WDL file directly, it is considered best practice to supply them at runtime instead for re-usability. This is a convenient feature when importing WDL files from other groups, as it removes the need to edit hardcoded values.

Example Workflow Input File

In Cromwell, Workflow Input Files are written in JSON. They are specified with the --inputs flag when Cromwell is executed at the command line. These files define the requirements of the workflow, such as input files, or other input values. Specifying these input values in a separate file prevents the need to hardcode inputs in the original workflow file.

From the above Example WDL File, the CromwellBowtie2 workflow utilizes the following values:

workflow CromwellBowtie2 {
    File input_fq
    File index_dir
    String index_name
    Int cpus_per_task

The following JSON file provides definitions for each of these values, and can be found at /usr/local/training/Cromwell/inputs.json:

inputs.json
{   "CromwellBowtie2.input_fq": "myreads.fq",
    "CromwellBowtie2.index_dir": "index",
    "CromwellBowtie2.index_name": "lambda_virus",
    "CromwellBowtie2.cpus_per_task": "8"
}

Usage: --inputs input.json

In an input file, variables are referenced using the workflow name followed by a period and the variable name CromwellBowtie2.input_fq.

The example data referenced in this JSON file can be found at the following locations:

  • /usr/local/training/Cromwell/index
  • /usr/local/training/Cromwell/myreads.fq
Example Workflow Options File

In Cromwell, Workflow Options Files, are also written in JSON. They are specified with the --options flag when Cromwell is executed at the command line. These files describe the options to use during the execution of a workflow.

By default, the output of a workflow step is stored in that step's execution directory.

The following JSON file makes use of Cromwell's Output Copying capabilities to copy the output into a directory named output, and can be found at /usr/local/training/Cromwell/options.json:

options.json
{   "final_workflow_outputs_dir": "output",
    "use_relative_output_paths": true
}

Usage: --options options.json

Without specifying an alternative output directory, the output would be in a location similar to the following:

  • ./cromwell-executions/CromwellBowtie2/04d44744-2a84-4b2f-bea6-492985543ace/call-Bowtie2/execution/

Where cromwell-executions is a subdirectory of the working directory, the 04d44744-2a84-4b2f-bea6-492985543ace directory is named at runtime, and execution was the working directory during the execution of the task, Bowtie2.

Example Job submission Script

The following is an example job submission script that utilizes the files described above, and can be found at /usr/local/training/Cromwell/cromwell-sub.sh:

cromwell-sub.sh
#!/bin/bash
#SBATCH --job-name=cromwell-bowtie2
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=8gb
#SBATCH --time=00:10:00
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err

module load cromwell/56-Java-11
module load Bowtie2/2.4.5-GCC-11.3.0

cd $SLURM_SUBMIT_DIR

java \
	-Xmx8g \
	-Dconfig.file=cromwell-gacrc.conf \
	-jar $EBROOTCROMWELL/cromwell.jar \
	run cromwell-bowtie2.wdl \
	--inputs inputs.json \
	--options options.json

Where:

  • -Xmx8g instructs the Java Virtual Machine to allocate 8g of memory, which is equal to the amount requested in the SLURM header (--mem=8gb).
  • -Dconfig.file=cromwell-gacrc.conf is the path to the configuration file.
  • -jar $EBROOTCROMWELL/cromwell.jar is the Java Archive to run, which in this case is the Cromwell executable.
  • run cromwell-bowtie2.wdl contains the subcommand, run, and instructs Cromwell to run the workflow in Command Line mode.
  • --inputs inputs.json specifies the workflow inputs are defined in the inputs.json file.
  • --options options.json specifies any additional workflow options are defined in the options.json file.

Running the example

To run the above example, navigate to scratch and copy the files into the working directory:

mkdir /scratch/$USER/cromwell-example
cd /scratch/$USER/cromwell-example
cp -r /usr/local/training/Cromwell/* ./

Once copied, the job can be submitted with sbatch:

sbatch cromwell-sub.sh

Installation

  • Version 56: Installed using EasyBuild.

System

  • 64-bit Linux