SeqKit-Teaching
Category
Bioinformatics
Program On
Teaching
Version
0.10.2
Author / Distributor
W Shen. More details at Seqkit
Citation: W Shen, S Le, Y Li*, F Hu*. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLOS ONE. doi:10.1371/journal.pone.0163962.
Description
From Seqkit: "SeqKit - a cross-platform and ultrafast toolkit for FASTA/Q file manipulation."
Running Program
Also refer to Running Jobs on the teaching cluster
Version 0.10.2
- Version 0.10.2, installed as a conda virtual environment in /usr/local/apps/gb/seqkit/0.10.2
To use seqkit v. 0.10.2, please first load the module with
module load seqkit/0.10.2_conda
This module will automatically load Miniconda3/4.4.10.
Example of a job submission script (sub.sh) to run seqkit v. 0.10.2 in the batch queue:
#!/bin/bash
#SBATCH --job-name=j_seqkit
#SBATCH --partition=batch
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@uga.edu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=10gb
#SBATCH --time=08:00:00
#SBATCH --output=seqkit.%j.out
#SBATCH --error=seqkit.%j.err
cd $SLURM_SUBMIT_DIR
module load seqkit/0.10.2_conda
conda activate ${SEQKITROOT}
seqkit -j 4 [options]
conda deactivate
where SEQKITROOT is the environment variable that stores the seqkit conda environment installation path, i.e., /usr/local/apps/gb/seqkit/0.10.2, and it is defined in the seqkit/0.10.2_conda module file; [options] need to be replaced by the options (command and arguments) you want to use. The parameters of the job, such as the maximum wall clock time, maximum memory, the number of cores per node, and the job name need to be modified appropriately as well. Note that if you use the seqkit -j option to specify the number of threads to use, please request the same number of cores with the queueing system --cpus-per-task option (in the sample script above this is set to 4).
In the real submission script, at least all the above underlined values need to be reviewed or to be replaced by the proper values.
Submit the job to the queue with
sbatch ./sub.sh
Documentation
More details at https://bioinf.shenwei.me/seqkit/
ml seqkit/0.10.2_conda seqkit -h SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation Version: 0.10.2 Author: Wei Shen <shenwei356@gmail.com> Documents : http://bioinf.shenwei.me/seqkit Source code: https://github.com/shenwei356/seqkit Please cite: https://doi.org/10.1371/journal.pone.0163962 Usage: seqkit [command] Available Commands: common find common sequences of multiple files by id/name/sequence concat concatenate sequences with same ID from multiple files convert convert FASTQ quality encoding between Sanger, Solexa and Illumina duplicate duplicate sequences N times faidx create FASTA index file and extract subsequence fq2fa convert FASTQ to FASTA fx2tab convert FASTA/Q to tabular format (with length/GC content/GC skew) genautocomplete generate shell autocompletion script grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed head print first N FASTA/Q records help Help about any command locate locate subsequences/motifs, mismatch allowed mutate edit sequence (point mutation, insertion, deletion) range print FASTA/Q records in a range (start:end) rename rename duplicated IDs replace replace name/sequence by regular expression restart reset start position for circular genome rmdup remove duplicated sequences by id/name/sequence sample sample sequences by number or proportion seq transform sequences (revserse, complement, extract ID...) shuffle shuffle sequences sliding sliding sequences, circular genome supported sort sort sequences by id/name/sequence/length split split sequences into files by id/seq region/size/parts (mainly for FASTA) split2 split sequences into files by size/parts (FASTA, PE/SE FASTQ) stats simple statistics of FASTA/Q files subseq get subsequences by region/gtf/bed, including flanking sequences tab2fx convert tabular format to FASTA/Q format translate translate DNA/RNA to protein sequence (supporting ambiguous bases) version print version information and check for update Flags: --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000) -h, --help help for seqkit --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud... --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?") -w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60) -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-") --quiet be quiet and do not show extra information -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto") -j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2) Use "seqkit [command] --help" for more information about a command.
Installation
- Version 0.10.2 is installed as a conda virtual environment
System
64-bit Linux