GEM-Sapelo2
Category
Bioinformatics
Program On
Sapelo2
Version
1.5.1
Author / Distributor
Please see https://github.com/large-scale-gxe-methods/GEM/
Description
GEM (Gene-Environment interaction analysis for Millions of samples) is a software program for large-scale gene-environment interaction testing in samples from unrelated individuals. It enables genome-wide association studies in up to millions of samples while allowing for multiple exposures, control for genotype-covariate interactions, and robust inference.
Running Program
Also refer to Running Jobs on Sapelo2. For more information on Environment Modules on Sapelo2 please see the Lmod page.
When using GEM on the cluster, it is important to include --threads option, and set its value to 1. Otherwise, GEM will create many threads per CPU core, causing the software to run inefficiently and overload the node it's running on. Here is an example submission script to run GEM in the batch partition:
#!/bin/bash #SBATCH --partition=batch #SBATCH --ntasks=1 #SBATCH --cpus-per-task=16 #SBATCH --time=30:00:00 #SBATCH --mem=10gb ml GEM/1.5.1-foss-2022a GEM --threads 1 [options]
where [options] need to be replaced by the GEM command options you want to use. Other parameters of the job, such as the maximum wall clock time, maximum memory, the number of cores per node, and the job name need to be modified appropriately as well.
Documentation
Please see https://large-scale-gxe-methods.github.io/GEM-website/index.html.
You can also run the binaries with the -h option to see the usage. For example:
[cft07037@d2-13 ~]$ GEM -h ********************************************************* Welcome to GEM v1.5.1 (C) 2018-2023 Liang Hong, Han Chen, Duy Pham, Cong Pan GNU General Public License v3 ********************************************************* General Options: --help Prints available options and exits. --version Prints the version of GEM and exits. Input/Output File Options: --pheno-file Path to the phenotype file. --bgen Path to the BGEN file. --sample Path to the sample file. Required when the BGEN file does not contain sample identifiers. --pfile Path and prefix to the .pgen, .pvar, and .psam files. --pgen Path to the pgen file. --pvar Path to the pvar file. --psam Path to the psam file. --bfile Path and prefix to the .bed, .bim and .fam files. --bed Path to the bed file. --bim Path to the bim file. --fam Path to the fam file. --out Full path and extension to where GEM output results. Default: gem.out --output-style Modifies the output of GEM. Must be one of the following: minimum: Output the summary statistics for only the GxE and marginal G terms. meta: 'minimum' output plus additional fields for the main G and any GxCovariate terms For a robust analysis, additional columns for the model-based summary statistics will be included. full: 'meta' output plus additional fields needed for re-analyses of a subset of interactions Default: minimum Phenotype File Options: --sampleid-name Column name in the phenotype file that contains sample identifiers. --pheno-name Column name in the phenotype file that contains the phenotype of interest. If the number of levels (unique observations) is 2, the phenotype is treated as binary; otherwise it is assumed to be continuous. --exposure-names One or more column names in the phenotype file naming the exposure(s) to be included in interaction tests. --int-covar-names Any column names in the phenotype file naming the covariate(s) for which interactions should be included for adjustment (mutually exclusive with --exposure-names). --covar-names Any column names in the phenotype file naming the covariates for which only main effects should be included for adjustment (mutually exclusive with both --exposure-names and --int-covar-names). --robust 0 for model-based standard errors and 1 for robust standard errors. Default: 0 --tol Convergence tolerance for logistic regression. Default: 0.0000001 --delim Delimiter separating values in the phenotype file. Tab delimiter should be represented as \t and space delimiter as \0. Default: , (comma-separated) --missing-value Indicates how missing values in the phenotype file are stored. Default: NA --center 0 for no centering to be done and 1 to center ALL exposures and covariates, 2 to center interaction covariates only. Default: 2 --scale 0 for no scaling to be done and 1 to scale ALL exposures and covariates by the standard deviation. Default: 0 --categorical-names Names of the exposure or interaction covariate that should be treated as categorical. Default: None --cat-threshold A cut-off to determine which exposure or interaction covariate not specified using --categorical-names should be automatically treated as categorical based on the number of levels (unique observations). Default: 20 Filtering Options: --maf Threshold to filter variants based on the minor allele frequency. Default: 0.001 --miss-geno-cutoff Threshold to filter variants based on the missing genotype rate. Default: 0.05 --include-snp-file Path to file containing a subset of variants in the specified genotype file to be used for analysis. The first line in this file is the header that specifies which variant identifier in the genotype file is used for ID matching. This must be 'snpid' (PLINK or BGEN) or 'rsid' (BGEN only). There should be one variant identifier per line after the header. Performance Options: --threads Set number of compute threads Default: ceiling(detected threads / 2) --stream-snps Number of SNPs to analyze in a batch. Memory consumption will increase for larger values of stream-snps. Default: 1
Installation
Source code downloaded from: https://github.com/large-scale-gxe-methods/GEM/
System
64-bit Linux