Useful Unix/Linux One-Liners for Bioinformatics

Useful Unix/Linux One-Liners for Bioinformatics by Stephen Turner.

From the post:

Much of the work that bioinformaticians do is munging and wrangling around massive amounts of text. While there are some “standardized” file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Unix/Linux is extremely helpful, namely awk, sed, cut, grep, GNU parallel, and others.

This is by no means an exhaustive catalog, but I’ve put together a short list of examples using various Unix/Linux utilities for text manipulation, from the very basic (e.g., sum a column) to the very advanced (munge a FASTQ file and print the total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, and its frequency). Most of these examples (with the exception of the SeqTK examples) use built-in utilities installed on nearly every Linux system. These examples are a combination of tactics I used everyday and examples culled from other sources listed at the top of the page.

What one liners do you have laying about?

For what data sets?

Comments are closed.