Bioawk for handling bioinformatics formats

By | January 25, 2020

Today I found a new tool: bioawk that was written by Heng Li who also wrote samtools and bwa.

I first discovered it ont this blog: bioawk-basics (Bioinformatics Workbooks)

There is also a short tutorial on GitHub:

I also found a recent docker image, and in fact there are only 2 images on docker hub:

  • lbmc/bioawk updated 2 months ago, with only 10 downloads.
  • The other is 2 years ago and has no download and might not be functional

So I tried lbmc/bioawk and it worked!

Download the image:

docker pull lbmc/bioawk:1.0

Run interactive, share current directory as /data  and start image in that directory:

docker run -it --rm -v $(pwd):/data -w /data lbmc/bioawk:1.0

Check OS:

uname -a
Linux 892a9a8d10b1 4.19.76-linuxkit #1 SMP Thu Oct 17 19:31:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Then, testing a fast file I had in my directory, based on examples from above links:

1. count number of records (NR)

bioawk -cfastx 'END{print NR}' ERR364233.subset.fastq


Indeed there are 200000 reads in this subset.

2. make a tab-delimited table of names and sequence lengths. For example for the first 5 reads (4 lines per read hence head -20)

head -20 ERR364233.subset.fastq |  bioawk -cfastx '{print $name, length($seq)}'

ERR364233.2028526   246
ERR364233.576388    239
ERR364233.501486    54
ERR364233.1331889   233
ERR364233.1008347   148

3. How many sequences are shorter (less than 80bp)

bioawk -cfastx 'BEGIN{ shorter = 0} {if (length($seq) < 80) shorter += 1} END {print "shorter sequences", shorter}' ERR364233.subset.fastq

shorter sequences   12032

There are of course many other commands, as detailed in the tutorial, mostly on the Bioinformatics Workbook one.


I was able to easily install  bioawk  on my Mac with brew but having a docker option is great to use anywhere.

