SeqKit-Teaching

From Research Computing Center Wiki
Revision as of 12:10, 20 September 2019 by Shtsai (talk | contribs) (→‎Running Program)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Category

Bioinformatics

Program On

Teaching

Version

0.10.2

Author / Distributor

W Shen. More details at Seqkit

Citation: W Shen, S Le, Y Li*, F Hu*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE. doi:10.1371/journal.pone.0163962.

Description

From Seqkit: "SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation."

Running Program

Also refer to Running Jobs on the teaching cluster

Version 0.10.2

  • Version 0.10.2, installed as a conda virtual environment in /usr/local/apps/gb/seqkit/0.10.2

To use seqkit v. 0.10.2, please first load the module with

module load seqkit/0.10.2_conda

This module will automatically load Miniconda3/4.4.10.


Example of a job submission script (sub.sh) to run seqkit v. 0.10.2 in the batch queue:

#!/bin/bash
#SBATCH --job-name=j_seqkit
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=seqkit.%j.out
#SBATCH --error=seqkit.%j.err

cd $SLURM_SUBMIT_DIR

module load seqkit/0.10.2_conda

source activate ${SEQKITROOT}

seqkit -j 4 [options]

source deactivate

where SEQKITROOT is the environment variable that stores the seqkit conda environment installation path, i.e., /usr/local/apps/gb/seqkit/0.10.2, and it is defined in the seqkit/0.10.2_conda module file; [options] need to be replaced by the options (command and arguments) you want to use. The parameters of the job, such as the maximum wall clock time, maximum memory, the number of cores per node, and the job name need to be modified appropriately as well. Note that if you use the seqkit -j option to specify the number of threads to use, please request the same number of cores with the queueing system --cpus-per-task option (in the sample script above this is set to 4).

In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.


Submit the job to the queue with

sbatch  ./sub.sh 

Documentation

More details at https://bioinf.shenwei.me/seqkit/

ml seqkit/0.10.2_conda 

seqkit -h

SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Version: 0.10.2

Author: Wei Shen <shenwei356@gmail.com>

Documents  : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite: https://doi.org/10.1371/journal.pone.0163962

Usage:
  seqkit [command]

Available Commands:
  common          find common sequences of multiple files by id/name/sequence
  concat          concatenate sequences with same ID from multiple files
  convert         convert FASTQ quality encoding between Sanger, Solexa and Illumina
  duplicate       duplicate sequences N times
  faidx           create FASTA index file and extract subsequence
  fq2fa           convert FASTQ to FASTA
  fx2tab          convert FASTA/Q to tabular format (with length/GC content/GC skew)
  genautocomplete generate shell autocompletion script
  grep            search sequences by ID/name/sequence/sequence motifs, mismatch allowed
  head            print first N FASTA/Q records
  help            Help about any command
  locate          locate subsequences/motifs, mismatch allowed
  mutate          edit sequence (point mutation, insertion, deletion)
  range           print FASTA/Q records in a range (start:end)
  rename          rename duplicated IDs
  replace         replace name/sequence by regular expression
  restart         reset start position for circular genome
  rmdup           remove duplicated sequences by id/name/sequence
  sample          sample sequences by number or proportion
  seq             transform sequences (revserse, complement, extract ID...)
  shuffle         shuffle sequences
  sliding         sliding sequences, circular genome supported
  sort            sort sequences by id/name/sequence/length
  split           split sequences into files by id/seq region/size/parts (mainly for FASTA)
  split2          split sequences into files by size/parts (FASTA, PE/SE FASTQ)
  stats           simple statistics of FASTA/Q files
  subseq          get subsequences by region/gtf/bed, including flanking sequences
  tab2fx          convert tabular format to FASTA/Q format
  translate       translate DNA/RNA to protein sequence (supporting ambiguous bases)
  version         print version information and check for update

Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
  -h, --help                            help for seqkit
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
      --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")
  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

Use "seqkit [command] --help" for more information about a command.


Back to Top

Installation

  • Version 0.10.2 is installed as a conda virtual environment

System

64-bit Linux