HTSeq-Teaching

From Research Computing Center Wiki
Jump to navigation Jump to search


Category

Bioinformatics

Program On

Teaching

Version

0.9.1

Author / Distributor

Simon Anders

Description

A framework to process and analyze data from high-throughput sequencing (HTS) assays. More information: http://www-huber.embl.de/users/anders/HTSeq/

Running Program

  • Version 0.9.1, installed in /usr/local/apps/eb/HTSeq/0.9.1-foss-2016b-Python-2.7.14

To use this version of HTSeq, please first load the module with

ml HTSeq/0.9.1-foss-2016b-Python-2.7.14

Here is an example of a shell script, sub.sh, to run on the batch queue:

#!/bin/bash
#SBATCH --job-name=j_BEDTools
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=HTSeq.%j.out
#SBATCH --error=HTSeq.%j.err

cd $SLURM_SUBMIT_DIR
ml HTSeq/0.9.1-foss-2016b-Python-2.7.14
htseq-count [options]

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.

Please refer to Running_Jobs_on_the_teaching_cluster, Run X window Jobs and Run interactive Jobs for more details of running jobs at Teaching cluster.


Here is an example of job submission command:

sbatch ./sub.sh 


Documentation

module load HTSeq/0.9.1-foss-2016b-Python-2.7.14
htseq-count -h
usage: htseq-count [options] alignment_file gff_file

This script takes one or more alignment files in SAM/BAM format and a feature
file in GFF format and calculates for each feature the number of reads mapping
to it. See http://htseq.readthedocs.io/en/master/count.html for details.

positional arguments:
  samfilenames          Path to the SAM/BAM files containing the mapped reads.
                        If '-' is selected, read from standard input
  featuresfilename      Path to the file containing the features

optional arguments:
  -h, --help            show this help message and exit
  -f {sam,bam}, --format {sam,bam}
                        type of <alignment_file> data, either 'sam' or 'bam'
                        (default: sam)
  -r {pos,name}, --order {pos,name}
                        'pos' or 'name'. Sorting order of <alignment_file>
                        (default: name). Paired-end sequencing data must be
                        sorted either by position or by read name, and the
                        sorting order must be specified. Ignored for single-
                        end data.
  --max-reads-in-buffer MAX_BUFFER_SIZE
                        When <alignment_file> is paired end sorted by
                        position, allow only so many reads to stay in memory
                        until the mates are found (raising this number will
                        use more memory). Has no effect for single end or
                        paired end sorted by name
  -s {yes,no,reverse}, --stranded {yes,no,reverse}
                        whether the data is from a strand-specific assay.
                        Specify 'yes', 'no', or 'reverse' (default: yes).
                        'reverse' means 'yes' with reversed strand
                        interpretation
  -a MINAQUAL, --minaqual MINAQUAL
                        skip all reads with alignment quality lower than the
                        given minimum value (default: 10)
  -t FEATURETYPE, --type FEATURETYPE
                        feature type (3rd column in GFF file) to be used, all
                        features of other type are ignored (default, suitable
                        for Ensembl GTF files: exon)
  -i IDATTR, --idattr IDATTR
                        GFF attribute to be used as feature ID (default,
                        suitable for Ensembl GTF files: gene_id)
  --additional-attr ADDITIONAL_ATTR [ADDITIONAL_ATTR ...]
                        Additional feature attributes (default: none, suitable
                        for Ensembl GTF files: gene_name)
  -m {union,intersection-strict,intersection-nonempty}, --mode {union,intersection-strict,intersection-nonempty}
                        mode to handle reads overlapping more than one feature
                        (choices: union, intersection-strict, intersection-
                        nonempty; default: union)
  --nonunique {none,all}
                        Whether to score reads that are not uniquely aligned
                        or ambiguously assigned to features
  --secondary-alignments {score,ignore}
                        Whether to score secondary alignments (0x100 flag)
  --supplementary-alignments {score,ignore}
                        Whether to score supplementary alignments (0x800 flag)
  -o SAMOUTS [SAMOUTS ...], --samout SAMOUTS [SAMOUTS ...]
                        write out all SAM alignment records into an output SAM
                        file called SAMOUT, annotating each line with its
                        feature assignment (as an optional field with tag
                        'XF')
  -q, --quiet           suppress progress report

Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology
Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General
Public License v3. Part of the 'HTSeq' framework, version 0.9.1.

htseq-qa -h
Usage: htseq-qa [options] read_file

This script take a file with high-throughput sequencing reads (supported
formats: SAM, Solexa _export.txt, FASTQ, Solexa _sequence.txt) and performs a
simply quality assessment by producing plots showing the distribution of
called bases and base-call quality scores by position within the reads. The
plots are output as a PDF file.

Options:
  -h, --help            show this help message and exit
  -t TYPE, --type=TYPE  type of read_file (one of: sam [default], bam, solexa-
                        export, fastq, solexa-fastq)
  -o OUTFILE, --outfile=OUTFILE
                        output filename (default is <read_file>.pdf)
  -r READLEN, --readlength=READLEN
                        the maximum read length (when not specified, the
                        script guesses from the file
  -g GAMMA, --gamma=GAMMA
                        the gamma factor for the contrast adjustment of the
                        quality score plot
  -n, --nosplit         do not split reads in unaligned and aligned ones
  -m MAXQUAL, --maxqual=MAXQUAL
                        the maximum quality score that appears in the data
                        (default: 41)

Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology
Laboratory (EMBL). (c) 2010. Released under the terms of the GNU General
Public License v3. Part of the 'HTSeq' framework, version 0.9.1.

Installation


System

64-bit Linux