GEM-Sapelo2

From Research Computing Center Wiki
Revision as of 08:38, 22 March 2022 by Ben (talk | contribs) (Created page with "Category:Sapelo2Category:SoftwareCategory:Bioinformatics === Category === Bioinformatics === Program On === Sapelo2 === Version === 1.0, 1.1, 1.2, 1.4.1...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Category

Bioinformatics

Program On

Sapelo2

Version

1.0, 1.1, 1.2, 1.4.1

Author / Distributor

Please see https://github.com/large-scale-gxe-methods/GEM/

Description

GEM (Gene-Environment interaction analysis for Millions of samples) is a software program for large-scale gene-environment interaction testing in samples from unrelated individuals. It enables genome-wide association studies in up to millions of samples while allowing for multiple exposures, control for genotype-covariate interactions, and robust inference.

Running Program

Also refer to Running Jobs on Sapelo2. For more information on Environment Modules on Sapelo2 please see the Lmod page.

When using GEM on the cluster, it is important to include --threads option, and set its value to 1. Otherwise, GEM will create many threads per CPU core, causing the software to run inefficiently and overload the node it's running on. Here is an example submission script to run GEM in the batch partition:

#!/bin/bash
#SBATCH --partition=batch
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --time=30:00:00
#SBATCH --mem=10gb

ml GEM/1.4.1-foss-2019b

GEM --threads 1 [options]

where [options] need to be replaced by the GEM command options you want to use. Other parameters of the job, such as the maximum wall clock time, maximum memory, the number of cores per node, and the job name need to be modified appropriately as well.

Documentation

Please see https://large-scale-gxe-methods.github.io/GEM-website/index.html.

You can also run the binaries with the -h option to see the usage. For example:

$ GEM -h

*********************************************************
Welcome to GEM v1.4
(C) 2018-2021 Liang Hong, Han Chen, Duy Pham 
GNU General Public License v3
*********************************************************
General Options: 
   --help 		 Prints available options and exits.
   --version 		 Prints the version of GEM and exits.


Input/Output File Options: 
   --pheno-file 	 Path to the phenotype file.
   --bgen 		 Path to the BGEN file.
   --sample 		 Path to the sample file. Required when the BGEN file does not contain sample identifiers.
   --pfile 		 Path and prefix to the .pgen, .pvar, and .psam files.
   --pgen 		 Path to the pgen file.
   --pvar 		 Path to the pvar file.
   --psam 		 Path to the psam file.
   --bfile 		 Path and prefix to the .bed, .bim and .fam files.
   --bed 		 Path to the bed file.
   --bim 		 Path to the bim file.
   --fam 		 Path to the fam file.
   --out 		 Full path and extension to where GEM output results. 
 			    Default: gem.out
   --output-style 	 Modifies the output of GEM. Must be one of the following: 
			    minimum: Output the summary statistics for only the GxE and marginal G terms. 
 			    meta: 'minimum' output plus additional fields for the main G and any GxCovariate terms 
 				  For a robust analysis, additional columns for the model-based summary statistics will be included.  
 			    full: 'meta' output plus additional fields needed for re-analyses of a subset of interactions 
 			    Default: minimum


Phenotype File Options: 
   --sampleid-name 	 Column name in the phenotype file that contains sample identifiers.
   --pheno-name 	 Column name in the phenotype file that contains the phenotype of interest.
 			   If the number of levels (unique observations) is 2, the phenotype is treated as binary;
 			   otherwise it is assumed to be continuous.
   --exposure-names 	 One or more column names in the phenotype file naming the exposure(s) to be included in interaction tests.
   --int-covar-names 	 Any column names in the phenotype file naming the covariate(s) for which interactions should
 			   be included for adjustment (mutually exclusive with --exposure-names).
   --covar-names 	 Any column names in the phenotype file naming the covariates for which only main effects should
 			   be included for adjustment (mutually exclusive with both --exposure-names and --int-covar-names).
   --robust 		 0 for model-based standard errors and 1 for robust standard errors. 
 			    Default: 0
   --tol 		 Convergence tolerance for logistic regression. 
 			    Default: 0.0000001
   --delim 		 Delimiter separating values in the phenotype file.
 			 Tab delimiter should be represented as \t and space delimiter as \0. 
 			    Default: , (comma-separated)
   --missing-value 	 Indicates how missing values in the phenotype file are stored. 
 			    Default: NA
   --center 		 0 for no centering to be done and 1 to center ALL exposures and covariates. 
 			    Default: 1
   --scale 		 0 for no scaling to be done and 1 to scale ALL exposures and covariates by the standard deviation. 
 			    Default: 0
   --categorical-names 	 Names of the exposure or interaction covariate that should be treated as categorical. 
 			    Default: None
   --cat-threshold 	 A cut-off to determine which exposure or interaction covariate not specified using --categorical-names
 			    should be automatically treated as categorical based on the number of levels (unique observations). 
 			    Default: 20


Filtering Options: 
   --maf 		 Threshold to filter variants based on the minor allele frequency.
 			    Default: 0.001
   --miss-geno-cutoff 	 Threshold to filter variants based on the missing genotype rate.
 			    Default: 0.05
   --include-snp-file 	 Path to file containing a subset of variants in the specified genotype file to be used for analysis. The first
 			   line in this file is the header that specifies which variant identifier in the genotype file is used for ID
 			   matching. This must be 'snpid' (PLINK or BGEN) or 'rsid' (BGEN only).
 			   There should be one variant identifier per line after the header.


Performance Options:
   --threads 		 Set number of compute threads 
 			    Default: ceiling(detected threads / 2)
   --stream-snps 	 Number of SNPs to analyze in a batch. Memory consumption will increase for larger values of stream-snps.
 			    Default: 1

Installation

Source code downloaded from: https://github.com/large-scale-gxe-methods/GEM/

System

64-bit Linux