RepeatMasker-Teaching
Category
Bioinformatics
Program On
Teaching
Version
4.0.7
Author / Distributor
Description
"RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). Currently over 56% of human genomic sequence is identified and masked by the program." More details are at RepeatMasker
Running Program
The last version of this application is at /usr/local/apps/eb/RepeatMasker/4.0.7-foss-2016b
To use this version, please load the module with
ml RepeatMasker/4.0.7-foss-2016b
Here is an example of a shell script, sub.sh, to run on the batch queue:
#!/bin/bash
#SBATCH --job-name=j_RepeatMasker
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=RepeatMasker.%j.out
#SBATCH --error=RepeatMasker.%j.err
cd $SLURM_SUBMIT_DIR
ml RepeatMasker/4.0.7-foss-2016b
RepeatMasker [options]
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.
Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.
Here is an example of job submission command:
sbatch ./sub.sh
Documentation
ml RepeatMasker/4.0.7-foss-2016b RepeatMasker -h Option h is ambiguous (help, html) RepeatMasker version open-4.0.7 NAME RepeatMasker - Mask repetitive DNA SYNOPSIS RepeatMasker [-options] <seqfiles(s) in fasta format> DESCRIPTION The options are: -h(elp) Detailed help Default settings are for masking all type of repeats in a primate sequence. -e(ngine) [crossmatch|wublast|abblast|ncbi|hmmer|decypher] Use an alternate search engine to the default. -pa(rallel) [number] The number of processors to use in parallel (only works for batch files or sequences over 50 kb) -s Slow search; 0-5% more sensitive, 2-3 times slower than default -q Quick search; 5-10% less sensitive, 2-5 times faster than default -qq Rush job; about 10% less sensitive, 4->10 times faster than default (quick searches are fine under most circumstances) repeat options -nolow /-low Does not mask low_complexity DNA or simple repeats -noint /-int Only masks low complex/simple repeats (no interspersed repeats) -norna Does not mask small RNA (pseudo) genes -alu Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA) -div [number] Masks only those repeats < x percent diverged from consensus seq -lib [filename] Allows use of a custom library (e.g. from another species) -cutoff [number] Sets cutoff score for masking repeats when using -lib (default 225) -species <query species> Specify the species or clade of the input sequence. The species name must be a valid NCBI Taxonomy Database species name and be contained in the RepeatMasker repeat database. Some examples are: -species human -species mouse -species rattus -species "ciona savignyi" -species arabidopsis Other commonly used species: mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu, danio, "ciona intestinalis" drosophila, anopheles, elegans, diatoaea, artiodactyl, arabidopsis, rice, wheat, and maize Contamination options -is_only Only clips E coli insertion elements out of fasta and .qual files -is_clip Clips IS elements before analysis (default: IS only reported) -no_is Skips bacterial insertion element check Running options -gc [number] Use matrices calculated for 'number' percentage background GC level -gccalc RepeatMasker calculates the GC content even for batch files/small seqs -frag [number] Maximum sequence length masked without fragmenting (default 60000, 300000 for DeCypher) -nocut Skips the steps in which repeats are excised -noisy Prints search engine progress report to screen (defaults to .stderr file) -nopost Do not postprocess the results of the run ( i.e. call ProcessRepeats ). NOTE: This options should only be used when ProcessRepeats will be run manually on the results. output options -dir [directory name] Writes output to this directory (default is query file directory, "-dir ." will write to current directory). -a(lignments) Writes alignments in .align output file -inv Alignments are presented in the orientation of the repeat (with option -a) -lcambig Outputs ambiguous DNA transposon fragments using a lower case name. All other repeats are listed in upper case. Ambiguous fragments match multiple repeat elements and can only be called based on flanking repeat information. -small Returns complete .masked sequence in lower case -xsmall Returns repetitive regions in lowercase (rest capitals) rather than masked -x Returns repetitive regions masked with Xs rather than Ns -poly Reports simple repeats that may be polymorphic (in file.poly) -source Includes for each annotation the HSP "evidence". Currently this option is only available with the "-html" output format listed below. -html Creates an additional output file in xhtml format. -ace Creates an additional output file in ACeDB format -gff Creates an additional Gene Feature Finding format output -u Creates an additional annotation file not processed by ProcessRepeats -xm Creates an additional output file in cross_match format (for parsing) -no_id Leaves out final column with unique ID for each element (was default) -e(xcln) Calculates repeat densities (in .tbl) excluding runs of >=20 N/Xs in the query SEE ALSO Crossmatch, ProcessRepeats COPYRIGHT Copyright 2007-2014 Arian Smit, Institute for Systems Biology AUTHORS Arian Smit <asmit@systemsbiology.org> Robert Hubley <rhubley@systemsbiology.org>
Installation
Source code is obtained from RepeatMasker
System
64-bit Linux