RepeatMasker-Teaching
Category
Bioinformatics
Program On
Teaching
Version
4.0.7
Author / Distributor
Description
"RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). Currently over 56% of human genomic sequence is identified and masked by the program." More details are at RepeatMasker
Running Program
The last version of this application is at /usr/local/apps/eb/RepeatMasker/4.0.7-foss-2016b
To use this version, please load the module with
ml RepeatMasker/4.0.7-foss-2016b
Here is an example of a shell script, sub.sh, to run on the batch queue:
#!/bin/bash
#SBATCH --job-name=j_RepeatMasker
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=RepeatMasker.%j.out
#SBATCH --error=RepeatMasker.%j.err
cd $SLURM_SUBMIT_DIR
ml RepeatMasker/4.0.7-foss-2016b
RepeatMasker [options]
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.
Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.
Here is an example of job submission command:
sbatch ./sub.sh
Documentation
ml RepeatMasker/4.0.7-foss-2016b
RepeatMasker -h
Option h is ambiguous (help, html)
RepeatMasker version open-4.0.7
NAME
RepeatMasker - Mask repetitive DNA
SYNOPSIS
RepeatMasker [-options] <seqfiles(s) in fasta format>
DESCRIPTION
The options are:
-h(elp)
Detailed help
Default settings are for masking all type of repeats in a primate
sequence.
-e(ngine) [crossmatch|wublast|abblast|ncbi|hmmer|decypher]
Use an alternate search engine to the default.
-pa(rallel) [number]
The number of processors to use in parallel (only works for batch
files or sequences over 50 kb)
-s Slow search; 0-5% more sensitive, 2-3 times slower than default
-q Quick search; 5-10% less sensitive, 2-5 times faster than default
-qq Rush job; about 10% less sensitive, 4->10 times faster than default
(quick searches are fine under most circumstances) repeat options
-nolow /-low
Does not mask low_complexity DNA or simple repeats
-noint /-int
Only masks low complex/simple repeats (no interspersed repeats)
-norna
Does not mask small RNA (pseudo) genes
-alu
Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)
-div [number]
Masks only those repeats < x percent diverged from consensus seq
-lib [filename]
Allows use of a custom library (e.g. from another species)
-cutoff [number]
Sets cutoff score for masking repeats when using -lib (default 225)
-species <query species>
Specify the species or clade of the input sequence. The species name
must be a valid NCBI Taxonomy Database species name and be contained
in the RepeatMasker repeat database. Some examples are:
-species human
-species mouse
-species rattus
-species "ciona savignyi"
-species arabidopsis
Other commonly used species:
mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu,
danio, "ciona intestinalis" drosophila, anopheles, elegans,
diatoaea, artiodactyl, arabidopsis, rice, wheat, and maize
Contamination options
-is_only
Only clips E coli insertion elements out of fasta and .qual files
-is_clip
Clips IS elements before analysis (default: IS only reported)
-no_is
Skips bacterial insertion element check
Running options
-gc [number]
Use matrices calculated for 'number' percentage background GC level
-gccalc
RepeatMasker calculates the GC content even for batch files/small
seqs
-frag [number]
Maximum sequence length masked without fragmenting (default 60000,
300000 for DeCypher)
-nocut
Skips the steps in which repeats are excised
-noisy
Prints search engine progress report to screen (defaults to .stderr
file)
-nopost
Do not postprocess the results of the run ( i.e. call ProcessRepeats
). NOTE: This options should only be used when ProcessRepeats will
be run manually on the results.
output options
-dir [directory name]
Writes output to this directory (default is query file directory,
"-dir ." will write to current directory).
-a(lignments)
Writes alignments in .align output file
-inv
Alignments are presented in the orientation of the repeat (with
option -a)
-lcambig
Outputs ambiguous DNA transposon fragments using a lower case name.
All other repeats are listed in upper case. Ambiguous fragments
match multiple repeat elements and can only be called based on
flanking repeat information.
-small
Returns complete .masked sequence in lower case
-xsmall
Returns repetitive regions in lowercase (rest capitals) rather than
masked
-x Returns repetitive regions masked with Xs rather than Ns
-poly
Reports simple repeats that may be polymorphic (in file.poly)
-source
Includes for each annotation the HSP "evidence". Currently this
option is only available with the "-html" output format listed
below.
-html
Creates an additional output file in xhtml format.
-ace
Creates an additional output file in ACeDB format
-gff
Creates an additional Gene Feature Finding format output
-u Creates an additional annotation file not processed by
ProcessRepeats
-xm Creates an additional output file in cross_match format (for
parsing)
-no_id
Leaves out final column with unique ID for each element (was
default)
-e(xcln)
Calculates repeat densities (in .tbl) excluding runs of >=20 N/Xs in
the query
SEE ALSO
Crossmatch, ProcessRepeats
COPYRIGHT
Copyright 2007-2014 Arian Smit, Institute for Systems Biology
AUTHORS
Arian Smit <asmit@systemsbiology.org>
Robert Hubley <rhubley@systemsbiology.org>
Installation
Source code is obtained from RepeatMasker
System
64-bit Linux