GATK-Teaching: Difference between revisions
No edit summary |
No edit summary |
||
| Line 44: | Line 44: | ||
cd $SLURM_SUBMIT_DIR<br> | cd $SLURM_SUBMIT_DIR<br> | ||
ml GATK/3.4-0-Java-1.8.0_144<br> | ml GATK/3.4-0-Java-1.8.0_144<br> | ||
java -jar $EBROOTGATK/GenomeAnalysisTK.jar <u>[options]</u><br> | |||
</div> | </div> | ||
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values. | In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values. | ||
| Line 60: | Line 60: | ||
<pre class="gcommand"> | <pre class="gcommand"> | ||
ml GATK/3.4-0-Java-1.8.0_144 | ml GATK/3.4-0-Java-1.8.0_144 | ||
gatk -h | java -jar $EBROOTGATK/GenomeAnalysisTK.jar -h | ||
-------------------------------------------------------------------------------- | |||
The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41 | |||
Copyright (c) 2010 The Broad Institute | |||
For support and documentation go to http://www.broadinstitute.org/gatk | |||
-------------------------------------------------------------------------------- | |||
-------------------------------------------------------------------------------- | |||
usage: java -jar GenomeAnalysisTK.jar -T <analysis_type> [-args <arg_file>] [-I <input_file>] [--showFullBamList] [-rbs | |||
<read_buffer_size>] [-et <phone_home>] [-K <gatk_key>] [-tag <tag>] [-rf <read_filter>] [-drf <disable_read_filter>] [-L | |||
<intervals>] [-XL <excludeIntervals>] [-isr <interval_set_rule>] [-im <interval_merging>] [-ip <interval_padding>] [-R | |||
<reference_sequence>] [-ndrs] [-maxRuntime <maxRuntime>] [-maxRuntimeUnits <maxRuntimeUnits>] [-dt <downsampling_type>] | |||
[-dfrac <downsample_to_fraction>] [-dcov <downsample_to_coverage>] [-baq <baq>] [-baqGOP <baqGapOpenPenalty>] [-fixNDN] | |||
[-fixMisencodedQuals] [-allowPotentiallyMisencodedQuals] [-OQ] [-DBQ <defaultBaseQualities>] [-PF <performanceLog>] | |||
[-BQSR <BQSR>] [-qq <quantize_quals>] [-DIQ] [-EOQ] [-preserveQ <preserve_qscores_less_than>] [-globalQScorePrior | |||
<globalQScorePrior>] [-S <validation_strictness>] [-rpr] [-kpr] [-sample_rename_mapping_file | |||
<sample_rename_mapping_file>] [-U <unsafe>] [-disable_auto_index_creation_and_locking_when_reading_rods] [-sites_only] | |||
[-writeFullFormat] [-compress <bam_compression>] [-simplifyBAM] [--disable_bam_indexing] [--generate_md5] [-nt | |||
<num_threads>] [-nct <num_cpu_threads_per_data_thread>] [-mte] [-bfh <num_bam_file_handles>] [-rgbl | |||
<read_group_black_list>] [-ped <pedigree>] [-pedString <pedigreeString>] [-pedValidationType <pedigreeValidationType>] | |||
[-variant_index_type <variant_index_type>] [-variant_index_parameter <variant_index_parameter>] [-l <logging_level>] | |||
[-log <log_to_file>] [-h] [-version] | |||
-T,--analysis_type <analysis_type> Name of the tool to run | |||
-args,--arg_file <arg_file> Reads arguments from the | |||
specified file | |||
-I,--input_file <input_file> Input file containing sequence | |||
data (SAM or BAM) | |||
--showFullBamList Emit a log entry (level INFO) | |||
containing the full list of | |||
sequence data files to be | |||
included in the analysis | |||
(including files inside | |||
.bam.list files). | |||
-rbs,--read_buffer_size <read_buffer_size> Number of reads per SAM file | |||
to buffer in memory | |||
-et,--phone_home <phone_home> Run reporting mode (NO_ET|AWS| | |||
STDOUT) | |||
-K,--gatk_key <gatk_key> GATK key file required to run | |||
with -et NO_ET | |||
-tag,--tag <tag> Tag to identify this GATK run | |||
as part of a group of runs | |||
-rf,--read_filter <read_filter> Filters to apply to reads | |||
before analysis | |||
-drf,--disable_read_filter <disable_read_filter> Read filters to disable | |||
-L,--intervals <intervals> One or more genomic intervals | |||
over which to operate | |||
-XL,--excludeIntervals <excludeIntervals> One or more genomic intervals | |||
to exclude from processing | |||
-isr,--interval_set_rule <interval_set_rule> Set merging approach to use | |||
for combining interval inputs | |||
(UNION|INTERSECTION) | |||
-im,--interval_merging <interval_merging> Interval merging rule for | |||
abutting intervals (ALL| | |||
OVERLAPPING_ONLY) | |||
-ip,--interval_padding <interval_padding> Amount of padding (in bp) to | |||
add to each interval | |||
-R,--reference_sequence <reference_sequence> Reference sequence file | |||
-ndrs,--nonDeterministicRandomSeed Use a non-deterministic random | |||
seed | |||
-maxRuntime,--maxRuntime <maxRuntime> Stop execution cleanly as soon | |||
as maxRuntime has been reached | |||
-maxRuntimeUnits,--maxRuntimeUnits <maxRuntimeUnits> Unit of time used by | |||
maxRuntime (NANOSECONDS| | |||
MICROSECONDS|MILLISECONDS| | |||
SECONDS|MINUTES|HOURS|DAYS) | |||
-dt,--downsampling_type <downsampling_type> Type of read downsampling to | |||
employ at a given locus (NONE| | |||
ALL_READS|BY_SAMPLE) | |||
-dfrac,--downsample_to_fraction <downsample_to_fraction> Fraction of reads to | |||
downsample to | |||
-dcov,--downsample_to_coverage <downsample_to_coverage> Target coverage threshold for | |||
downsampling to coverage | |||
-baq,--baq <baq> Type of BAQ calculation to | |||
apply in the engine (OFF| | |||
CALCULATE_AS_NECESSARY| | |||
RECALCULATE) | |||
-baqGOP,--baqGapOpenPenalty <baqGapOpenPenalty> BAQ gap open penalty | |||
-fixNDN,--refactor_NDN_cigar_string Reduce NDN elements in CIGAR | |||
string | |||
-fixMisencodedQuals,--fix_misencoded_quality_scores Fix mis-encoded base quality | |||
scores | |||
-allowPotentiallyMisencodedQuals,--allow_potentially_misencoded_quality_scores Ignore warnings about base | |||
quality score encoding | |||
-OQ,--useOriginalQualities Use the base quality scores | |||
from the OQ tag | |||
-DBQ,--defaultBaseQualities <defaultBaseQualities> Assign a default base quality | |||
-PF,--performanceLog <performanceLog> Write GATK runtime performance | |||
log to this file | |||
-BQSR,--BQSR <BQSR> Input covariates table file | |||
for on-the-fly base quality | |||
score recalibration | |||
-qq,--quantize_quals <quantize_quals> Quantize quality scores to a | |||
given number of levels (with | |||
-BQSR) | |||
-DIQ,--disable_indel_quals Disable printing of base | |||
insertion and deletion tags | |||
(with -BQSR) | |||
-EOQ,--emit_original_quals Emit the OQ tag with the | |||
original base qualities (with | |||
-BQSR) | |||
-preserveQ,--preserve_qscores_less_than <preserve_qscores_less_than> Don't recalibrate bases with | |||
quality scores less than this | |||
threshold (with -BQSR) | |||
-globalQScorePrior,--globalQScorePrior <globalQScorePrior> Global Qscore Bayesian prior | |||
to use for BQSR | |||
-S,--validation_strictness <validation_strictness> How strict should we be with | |||
validation (STRICT|LENIENT| | |||
SILENT) | |||
-rpr,--remove_program_records Remove program records from | |||
the SAM header | |||
-kpr,--keep_program_records Keep program records in the | |||
SAM header | |||
-sample_rename_mapping_file,--sample_rename_mapping_file <sample_rename_mapping_file> Rename sample IDs on-the-fly | |||
at runtime using the provided | |||
mapping file | |||
-U,--unsafe <unsafe> Enable unsafe operations: | |||
nothing will be checked at | |||
runtime (ALLOW_N_CIGAR_READS| | |||
ALLOW_UNINDEXED_BAM| | |||
ALLOW_UNSET_BAM_SORT_ORDER| | |||
NO_READ_ORDER_VERIFICATION| | |||
ALLOW_SEQ_DICT_INCOMPATIBILITY| | |||
LENIENT_VCF_PROCESSING|ALL) | |||
d_locking_when_reading_rods,--disable_auto_index_creation_and_locking_when_reading_rods Disable both auto-generation | |||
of index files and index file | |||
locking | |||
-sites_only,--sites_only Just output sites without | |||
genotypes (i.e. only the first | |||
8 columns of the VCF) | |||
-writeFullFormat,--never_trim_vcf_format_field Always output all the records | |||
in VCF FORMAT fields, even if | |||
some are missing | |||
-compress,--bam_compression <bam_compression> Compression level to use for | |||
writing BAM files (0 - 9, | |||
higher is more compressed) | |||
-simplifyBAM,--simplifyBAM If provided, output BAM files | |||
will be simplified to include | |||
just key reads for downstream | |||
variation discovery analyses | |||
(removing duplicates, PF-, | |||
non-primary reads), as well | |||
stripping all extended tags | |||
from the kept reads except the | |||
read group identifier | |||
--disable_bam_indexing Turn off on-the-fly creation | |||
of indices for output BAM | |||
files. | |||
--generate_md5 Enable on-the-fly creation of | |||
md5s for output BAM files. | |||
-nt,--num_threads <num_threads> Number of data threads to | |||
allocate to this analysis | |||
-nct,--num_cpu_threads_per_data_thread <num_cpu_threads_per_data_thread> Number of CPU threads to | |||
allocate per data thread | |||
-mte,--monitorThreadEfficiency Enable threading efficiency | |||
monitoring | |||
-bfh,--num_bam_file_handles <num_bam_file_handles> Total number of BAM file | |||
handles to keep open | |||
simultaneously | |||
-rgbl,--read_group_black_list <read_group_black_list> Exclude read groups based on | |||
tags | |||
-ped,--pedigree <pedigree> Pedigree files for samples | |||
-pedString,--pedigreeString <pedigreeString> Pedigree string for samples | |||
-pedValidationType,--pedigreeValidationType <pedigreeValidationType> Validation strictness for | |||
pedigree information (STRICT| | |||
SILENT) | |||
-variant_index_type,--variant_index_type <variant_index_type> Type of IndexCreator to use | |||
for VCF/BCF indices | |||
(DYNAMIC_SEEK|DYNAMIC_SIZE| | |||
LINEAR|INTERVAL) | |||
-variant_index_parameter,--variant_index_parameter <variant_index_parameter> Parameter to pass to the | |||
VCF/BCF IndexCreator | |||
-l,--logging_level <logging_level> Set the minimum level of | |||
logging | |||
-log,--log_to_file <log_to_file> Set the logging location | |||
-h,--help Generate the help message | |||
-version,--version Output version information | |||
annotator | |||
VariantAnnotator Annotate variant calls with context information | |||
beagle | |||
BeagleOutputToVCF Takes files produced by Beagle imputation engine and creates a vcf with modified | |||
annotations. | |||
ProduceBeagleInput Converts the input VCF into a format accepted by the Beagle imputation/analysis | |||
program. | |||
VariantsToBeagleUnphased Produces an input file to Beagle imputation engine, listing unphased, hard-called | |||
genotypes for a single sample in input variant file. | |||
bqsr | |||
AnalyzeCovariates Create plots to visualize base recalibration results <p/> This tool generates plots | |||
for visualizing the quality of a recalibration run. | |||
BaseRecalibrator Generate base recalibration table to compensate for systematic errors | |||
coverage | |||
CallableLoci Collect statistics on callable, uncallable, poorly mapped, and other parts of the | |||
genome | |||
CompareCallableLoci Compare callability statistics | |||
DepthOfCoverage Assess sequence coverage by a wide array of metrics, partitioned by sample, read group, | |||
or library | |||
GCContentByInterval Calculates the GC content of the reference sequence for each interval | |||
diagnosetargets | |||
DiagnoseTargets Analyze coverage distribution and validate read mates per interval and per sample | |||
diagnostics | |||
BaseCoverageDistribution Evaluate coverage distribution per base | |||
CoveredByNSamplesSites Report well-covered intervals | |||
ErrorRatePerCycle Compute the read error rate per position | |||
FindCoveredIntervals Outputs a list of intervals that are covered above a given threshold | |||
ReadGroupProperties Collect statistics about read groups and their properties | |||
ReadLengthDistribution Collect read length statistics | |||
examples | |||
GATKPaperGenotyper A simple Bayesian genotyper, that outputs a text based call format. | |||
fasta | |||
FastaAlternateReferenceMaker Generate an alternative reference sequence over the specified interval | |||
FastaReferenceMaker Create a subset of a FASTA reference sequence | |||
FastaStats Calculate basic statistics about the reference sequence itself | |||
filters | |||
VariantFiltration Filter variant calls based on INFO and FORMAT annotations | |||
genotyper | |||
UnifiedGenotyper Call SNPs and indels on a per-locus basis | |||
haplotypecaller | |||
HaplotypeCaller Call SNPs and indels simultaneously via local re-assembly of haplotypes in an active | |||
region | |||
HaplotypeResolver Haplotype-based resolution of variants in separate callsets. | |||
indels | |||
IndelRealigner Perform local realignment of reads around indels | |||
LeftAlignIndels Left-align indels within reads in a bam file | |||
RealignerTargetCreator Define intervals to target for local realignment | |||
missing | |||
QualifyMissingIntervals Collect quality metrics for a set of intervals | |||
phasing | |||
PhaseByTransmission Compute the most likely genotype combination and phasing for trios and parent/child | |||
pairs | |||
ReadBackedPhasing Annotate physical phasing information | |||
qc | |||
CheckPileup Compare GATK's internal pileup to a reference Samtools pileup | |||
CountBases Count the number of bases in a set of reads | |||
CountIntervals Count contiguous regions in an interval list | |||
CountLoci Count the total number of covered loci | |||
CountMales Count the number of reads seen from male samples | |||
CountReadEvents Count the number of read events | |||
CountReads Count the number of reads | |||
CountRODs Count the number of ROD objects encountered | |||
CountRODsByRef Count the number of ROD objects encountered along the reference | |||
CountTerminusEvent Count the number of reads ending in insertions, deletions or soft-clips | |||
ErrorThrowing A walker that simply throws errors. | |||
FlagStat Collect statistics about sequence reads based on their SAM flags | |||
Pileup Print read alignments in Pileup-style format | |||
PrintRODs Print out all of the RODs in the input data set | |||
QCRef Quality control for the reference fasta | |||
ReadClippingStats Collect read clipping statistics | |||
readutils | |||
ClipReads Read clipping based on quality, position or sequence matching | |||
PrintReads Write out sequence read data (for filtering, merging, subsetting etc) | |||
ReadAdaptorTrimmer Utility tool to blindly strip base adaptors | |||
SplitSamFile Split a BAM file by sample | |||
rnaseq | |||
ASEReadCounter Calculate read counts per allele for allele-specific expression analysis | |||
SplitNCigarReads Splits reads that contain Ns in their CIGAR string | |||
simulatereads | |||
SimulateReadsForVariants Generate simulated reads for variants | |||
validation | |||
GenotypeAndValidate Genotype and validate a dataset and the calls of another dataset using the Unified | |||
Genotyper | |||
validationsiteselector | |||
ValidationSiteSelector Randomly select variant records according to specified options | |||
varianteval | |||
VariantEval General-purpose tool for variant evaluation (% in dbSNP, genotype concordance, Ti/Tv | |||
ratios, and a lot more) | |||
variantrecalibration | |||
ApplyRecalibration Apply a score cutoff to filter variants based on a recalibration table | |||
VariantRecalibrator Build a recalibration model to score variant quality for filtering purposes | |||
variantutils | |||
CalculateGenotypePosteriors Calculate genotype posterior likelihoods given panel data | |||
CombineGVCFs Combine per-sample gVCF files produced by HaplotypeCaller into a multi-sample gVCF file | |||
CombineVariants Combine variant records from different sources | |||
FilterLiftedVariants Filters a lifted-over VCF file for reference bases that have been changed | |||
GenotypeConcordance Genotype concordance between two callsets | |||
GenotypeGVCFs Perform joint genotyping on gVCF files produced by HaplotypeCaller | |||
LeftAlignAndTrimVariants Left-align indels in a variant callset | |||
LiftoverVariants Lifts a VCF file over from one build to another | |||
RandomlySplitVariants Randomly split variants into different sets | |||
RegenotypeVariants Regenotypes the variants from a VCF containing PLs or GLs. | |||
SelectHeaders Selects headers from a VCF source | |||
SelectVariants Select a subset of variants from a larger callset | |||
ValidateVariants Validate a VCF file with an extra strict set of criteria | |||
VariantsToAllelicPrimitives Simplify multi-nucleotide variants (MNPs) into more basic/primitive alleles. | |||
VariantsToBinaryPed Convert VCF to binary pedigree file | |||
VariantsToTable Extract specific fields from a VCF file to a tab-delimited table | |||
VariantsToVCF Convert variants from other file formats to VCF format | |||
VariantValidationAssessor Annotate a validation VCF with QC metrics | |||
</pre> | </pre> | ||
[[#top|Back to Top]] | [[#top|Back to Top]] | ||
Revision as of 14:19, 10 August 2018
Category
Bioinformatics
Program On
Teaching
Version
3.4-0
Author / Distributor
Description
"The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyse next-generation resequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size." More details are at GATK
Running Program
The last version of this application is at /usr/local/apps/eb/GATK/3.4-0-Java-1.8.0_144
To use this version, please load the module with
ml GATK/3.4-0-Java-1.8.0_144
Here is an example of a shell script, sub.sh, to run on the batch queue:
#!/bin/bash
#SBATCH --job-name=j_GATK
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=GATK.%j.out
#SBATCH --error=GATK.%j.err
cd $SLURM_SUBMIT_DIR
ml GATK/3.4-0-Java-1.8.0_144
java -jar $EBROOTGATK/GenomeAnalysisTK.jar [options]
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.
Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.
Here is an example of job submission command:
sbatch ./sub.sh
Documentation
ml GATK/3.4-0-Java-1.8.0_144
java -jar $EBROOTGATK/GenomeAnalysisTK.jar -h
--------------------------------------------------------------------------------
The Genome Analysis Toolkit (GATK) v3.4-0-g7e26428, Compiled 2015/05/15 03:25:41
Copyright (c) 2010 The Broad Institute
For support and documentation go to http://www.broadinstitute.org/gatk
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
usage: java -jar GenomeAnalysisTK.jar -T <analysis_type> [-args <arg_file>] [-I <input_file>] [--showFullBamList] [-rbs
<read_buffer_size>] [-et <phone_home>] [-K <gatk_key>] [-tag <tag>] [-rf <read_filter>] [-drf <disable_read_filter>] [-L
<intervals>] [-XL <excludeIntervals>] [-isr <interval_set_rule>] [-im <interval_merging>] [-ip <interval_padding>] [-R
<reference_sequence>] [-ndrs] [-maxRuntime <maxRuntime>] [-maxRuntimeUnits <maxRuntimeUnits>] [-dt <downsampling_type>]
[-dfrac <downsample_to_fraction>] [-dcov <downsample_to_coverage>] [-baq <baq>] [-baqGOP <baqGapOpenPenalty>] [-fixNDN]
[-fixMisencodedQuals] [-allowPotentiallyMisencodedQuals] [-OQ] [-DBQ <defaultBaseQualities>] [-PF <performanceLog>]
[-BQSR <BQSR>] [-qq <quantize_quals>] [-DIQ] [-EOQ] [-preserveQ <preserve_qscores_less_than>] [-globalQScorePrior
<globalQScorePrior>] [-S <validation_strictness>] [-rpr] [-kpr] [-sample_rename_mapping_file
<sample_rename_mapping_file>] [-U <unsafe>] [-disable_auto_index_creation_and_locking_when_reading_rods] [-sites_only]
[-writeFullFormat] [-compress <bam_compression>] [-simplifyBAM] [--disable_bam_indexing] [--generate_md5] [-nt
<num_threads>] [-nct <num_cpu_threads_per_data_thread>] [-mte] [-bfh <num_bam_file_handles>] [-rgbl
<read_group_black_list>] [-ped <pedigree>] [-pedString <pedigreeString>] [-pedValidationType <pedigreeValidationType>]
[-variant_index_type <variant_index_type>] [-variant_index_parameter <variant_index_parameter>] [-l <logging_level>]
[-log <log_to_file>] [-h] [-version]
-T,--analysis_type <analysis_type> Name of the tool to run
-args,--arg_file <arg_file> Reads arguments from the
specified file
-I,--input_file <input_file> Input file containing sequence
data (SAM or BAM)
--showFullBamList Emit a log entry (level INFO)
containing the full list of
sequence data files to be
included in the analysis
(including files inside
.bam.list files).
-rbs,--read_buffer_size <read_buffer_size> Number of reads per SAM file
to buffer in memory
-et,--phone_home <phone_home> Run reporting mode (NO_ET|AWS|
STDOUT)
-K,--gatk_key <gatk_key> GATK key file required to run
with -et NO_ET
-tag,--tag <tag> Tag to identify this GATK run
as part of a group of runs
-rf,--read_filter <read_filter> Filters to apply to reads
before analysis
-drf,--disable_read_filter <disable_read_filter> Read filters to disable
-L,--intervals <intervals> One or more genomic intervals
over which to operate
-XL,--excludeIntervals <excludeIntervals> One or more genomic intervals
to exclude from processing
-isr,--interval_set_rule <interval_set_rule> Set merging approach to use
for combining interval inputs
(UNION|INTERSECTION)
-im,--interval_merging <interval_merging> Interval merging rule for
abutting intervals (ALL|
OVERLAPPING_ONLY)
-ip,--interval_padding <interval_padding> Amount of padding (in bp) to
add to each interval
-R,--reference_sequence <reference_sequence> Reference sequence file
-ndrs,--nonDeterministicRandomSeed Use a non-deterministic random
seed
-maxRuntime,--maxRuntime <maxRuntime> Stop execution cleanly as soon
as maxRuntime has been reached
-maxRuntimeUnits,--maxRuntimeUnits <maxRuntimeUnits> Unit of time used by
maxRuntime (NANOSECONDS|
MICROSECONDS|MILLISECONDS|
SECONDS|MINUTES|HOURS|DAYS)
-dt,--downsampling_type <downsampling_type> Type of read downsampling to
employ at a given locus (NONE|
ALL_READS|BY_SAMPLE)
-dfrac,--downsample_to_fraction <downsample_to_fraction> Fraction of reads to
downsample to
-dcov,--downsample_to_coverage <downsample_to_coverage> Target coverage threshold for
downsampling to coverage
-baq,--baq <baq> Type of BAQ calculation to
apply in the engine (OFF|
CALCULATE_AS_NECESSARY|
RECALCULATE)
-baqGOP,--baqGapOpenPenalty <baqGapOpenPenalty> BAQ gap open penalty
-fixNDN,--refactor_NDN_cigar_string Reduce NDN elements in CIGAR
string
-fixMisencodedQuals,--fix_misencoded_quality_scores Fix mis-encoded base quality
scores
-allowPotentiallyMisencodedQuals,--allow_potentially_misencoded_quality_scores Ignore warnings about base
quality score encoding
-OQ,--useOriginalQualities Use the base quality scores
from the OQ tag
-DBQ,--defaultBaseQualities <defaultBaseQualities> Assign a default base quality
-PF,--performanceLog <performanceLog> Write GATK runtime performance
log to this file
-BQSR,--BQSR <BQSR> Input covariates table file
for on-the-fly base quality
score recalibration
-qq,--quantize_quals <quantize_quals> Quantize quality scores to a
given number of levels (with
-BQSR)
-DIQ,--disable_indel_quals Disable printing of base
insertion and deletion tags
(with -BQSR)
-EOQ,--emit_original_quals Emit the OQ tag with the
original base qualities (with
-BQSR)
-preserveQ,--preserve_qscores_less_than <preserve_qscores_less_than> Don't recalibrate bases with
quality scores less than this
threshold (with -BQSR)
-globalQScorePrior,--globalQScorePrior <globalQScorePrior> Global Qscore Bayesian prior
to use for BQSR
-S,--validation_strictness <validation_strictness> How strict should we be with
validation (STRICT|LENIENT|
SILENT)
-rpr,--remove_program_records Remove program records from
the SAM header
-kpr,--keep_program_records Keep program records in the
SAM header
-sample_rename_mapping_file,--sample_rename_mapping_file <sample_rename_mapping_file> Rename sample IDs on-the-fly
at runtime using the provided
mapping file
-U,--unsafe <unsafe> Enable unsafe operations:
nothing will be checked at
runtime (ALLOW_N_CIGAR_READS|
ALLOW_UNINDEXED_BAM|
ALLOW_UNSET_BAM_SORT_ORDER|
NO_READ_ORDER_VERIFICATION|
ALLOW_SEQ_DICT_INCOMPATIBILITY|
LENIENT_VCF_PROCESSING|ALL)
d_locking_when_reading_rods,--disable_auto_index_creation_and_locking_when_reading_rods Disable both auto-generation
of index files and index file
locking
-sites_only,--sites_only Just output sites without
genotypes (i.e. only the first
8 columns of the VCF)
-writeFullFormat,--never_trim_vcf_format_field Always output all the records
in VCF FORMAT fields, even if
some are missing
-compress,--bam_compression <bam_compression> Compression level to use for
writing BAM files (0 - 9,
higher is more compressed)
-simplifyBAM,--simplifyBAM If provided, output BAM files
will be simplified to include
just key reads for downstream
variation discovery analyses
(removing duplicates, PF-,
non-primary reads), as well
stripping all extended tags
from the kept reads except the
read group identifier
--disable_bam_indexing Turn off on-the-fly creation
of indices for output BAM
files.
--generate_md5 Enable on-the-fly creation of
md5s for output BAM files.
-nt,--num_threads <num_threads> Number of data threads to
allocate to this analysis
-nct,--num_cpu_threads_per_data_thread <num_cpu_threads_per_data_thread> Number of CPU threads to
allocate per data thread
-mte,--monitorThreadEfficiency Enable threading efficiency
monitoring
-bfh,--num_bam_file_handles <num_bam_file_handles> Total number of BAM file
handles to keep open
simultaneously
-rgbl,--read_group_black_list <read_group_black_list> Exclude read groups based on
tags
-ped,--pedigree <pedigree> Pedigree files for samples
-pedString,--pedigreeString <pedigreeString> Pedigree string for samples
-pedValidationType,--pedigreeValidationType <pedigreeValidationType> Validation strictness for
pedigree information (STRICT|
SILENT)
-variant_index_type,--variant_index_type <variant_index_type> Type of IndexCreator to use
for VCF/BCF indices
(DYNAMIC_SEEK|DYNAMIC_SIZE|
LINEAR|INTERVAL)
-variant_index_parameter,--variant_index_parameter <variant_index_parameter> Parameter to pass to the
VCF/BCF IndexCreator
-l,--logging_level <logging_level> Set the minimum level of
logging
-log,--log_to_file <log_to_file> Set the logging location
-h,--help Generate the help message
-version,--version Output version information
annotator
VariantAnnotator Annotate variant calls with context information
beagle
BeagleOutputToVCF Takes files produced by Beagle imputation engine and creates a vcf with modified
annotations.
ProduceBeagleInput Converts the input VCF into a format accepted by the Beagle imputation/analysis
program.
VariantsToBeagleUnphased Produces an input file to Beagle imputation engine, listing unphased, hard-called
genotypes for a single sample in input variant file.
bqsr
AnalyzeCovariates Create plots to visualize base recalibration results <p/> This tool generates plots
for visualizing the quality of a recalibration run.
BaseRecalibrator Generate base recalibration table to compensate for systematic errors
coverage
CallableLoci Collect statistics on callable, uncallable, poorly mapped, and other parts of the
genome
CompareCallableLoci Compare callability statistics
DepthOfCoverage Assess sequence coverage by a wide array of metrics, partitioned by sample, read group,
or library
GCContentByInterval Calculates the GC content of the reference sequence for each interval
diagnosetargets
DiagnoseTargets Analyze coverage distribution and validate read mates per interval and per sample
diagnostics
BaseCoverageDistribution Evaluate coverage distribution per base
CoveredByNSamplesSites Report well-covered intervals
ErrorRatePerCycle Compute the read error rate per position
FindCoveredIntervals Outputs a list of intervals that are covered above a given threshold
ReadGroupProperties Collect statistics about read groups and their properties
ReadLengthDistribution Collect read length statistics
examples
GATKPaperGenotyper A simple Bayesian genotyper, that outputs a text based call format.
fasta
FastaAlternateReferenceMaker Generate an alternative reference sequence over the specified interval
FastaReferenceMaker Create a subset of a FASTA reference sequence
FastaStats Calculate basic statistics about the reference sequence itself
filters
VariantFiltration Filter variant calls based on INFO and FORMAT annotations
genotyper
UnifiedGenotyper Call SNPs and indels on a per-locus basis
haplotypecaller
HaplotypeCaller Call SNPs and indels simultaneously via local re-assembly of haplotypes in an active
region
HaplotypeResolver Haplotype-based resolution of variants in separate callsets.
indels
IndelRealigner Perform local realignment of reads around indels
LeftAlignIndels Left-align indels within reads in a bam file
RealignerTargetCreator Define intervals to target for local realignment
missing
QualifyMissingIntervals Collect quality metrics for a set of intervals
phasing
PhaseByTransmission Compute the most likely genotype combination and phasing for trios and parent/child
pairs
ReadBackedPhasing Annotate physical phasing information
qc
CheckPileup Compare GATK's internal pileup to a reference Samtools pileup
CountBases Count the number of bases in a set of reads
CountIntervals Count contiguous regions in an interval list
CountLoci Count the total number of covered loci
CountMales Count the number of reads seen from male samples
CountReadEvents Count the number of read events
CountReads Count the number of reads
CountRODs Count the number of ROD objects encountered
CountRODsByRef Count the number of ROD objects encountered along the reference
CountTerminusEvent Count the number of reads ending in insertions, deletions or soft-clips
ErrorThrowing A walker that simply throws errors.
FlagStat Collect statistics about sequence reads based on their SAM flags
Pileup Print read alignments in Pileup-style format
PrintRODs Print out all of the RODs in the input data set
QCRef Quality control for the reference fasta
ReadClippingStats Collect read clipping statistics
readutils
ClipReads Read clipping based on quality, position or sequence matching
PrintReads Write out sequence read data (for filtering, merging, subsetting etc)
ReadAdaptorTrimmer Utility tool to blindly strip base adaptors
SplitSamFile Split a BAM file by sample
rnaseq
ASEReadCounter Calculate read counts per allele for allele-specific expression analysis
SplitNCigarReads Splits reads that contain Ns in their CIGAR string
simulatereads
SimulateReadsForVariants Generate simulated reads for variants
validation
GenotypeAndValidate Genotype and validate a dataset and the calls of another dataset using the Unified
Genotyper
validationsiteselector
ValidationSiteSelector Randomly select variant records according to specified options
varianteval
VariantEval General-purpose tool for variant evaluation (% in dbSNP, genotype concordance, Ti/Tv
ratios, and a lot more)
variantrecalibration
ApplyRecalibration Apply a score cutoff to filter variants based on a recalibration table
VariantRecalibrator Build a recalibration model to score variant quality for filtering purposes
variantutils
CalculateGenotypePosteriors Calculate genotype posterior likelihoods given panel data
CombineGVCFs Combine per-sample gVCF files produced by HaplotypeCaller into a multi-sample gVCF file
CombineVariants Combine variant records from different sources
FilterLiftedVariants Filters a lifted-over VCF file for reference bases that have been changed
GenotypeConcordance Genotype concordance between two callsets
GenotypeGVCFs Perform joint genotyping on gVCF files produced by HaplotypeCaller
LeftAlignAndTrimVariants Left-align indels in a variant callset
LiftoverVariants Lifts a VCF file over from one build to another
RandomlySplitVariants Randomly split variants into different sets
RegenotypeVariants Regenotypes the variants from a VCF containing PLs or GLs.
SelectHeaders Selects headers from a VCF source
SelectVariants Select a subset of variants from a larger callset
ValidateVariants Validate a VCF file with an extra strict set of criteria
VariantsToAllelicPrimitives Simplify multi-nucleotide variants (MNPs) into more basic/primitive alleles.
VariantsToBinaryPed Convert VCF to binary pedigree file
VariantsToTable Extract specific fields from a VCF file to a tab-delimited table
VariantsToVCF Convert variants from other file formats to VCF format
VariantValidationAssessor Annotate a validation VCF with QC metrics
Installation
Source code is obtained from GATK
System
64-bit Linux