Today I found a new tool: bioawk
that was written by Heng Li who also wrote samtools
and bwa
.
I first discovered it ont this blog: bioawk-basics (Bioinformatics Workbooks)
There is also a short tutorial on GitHub: github.com/vsbuffalo/bioawk-tutorial
I also found a recent docker image, and in fact there are only 2 images on docker hub:
- lbmc/bioawk updated 2 months ago, with only 10 downloads.
- The other is 2 years ago and has no download and might not be functional
So I tried lbmc/bioawk and it worked!
Download the image:
docker pull lbmc/bioawk:1.0
/data
and start image in that directory:docker run -it --rm -v $(pwd):/data -w /data lbmc/bioawk:1.0
Check OS:
uname -a
Linux 892a9a8d10b1 4.19.76-linuxkit #1 SMP Thu Oct 17 19:31:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Then, testing a fast file I had in my directory, based on examples from above links:
1. count number of records (NR
)
bioawk -cfastx 'END{print NR}' ERR364233.subset.fastq
200000
Indeed there are 200000 reads in this subset.
2. make a tab-delimited table of names and sequence lengths. For example for the first 5 reads (4 lines per read hence head -20)
head -20 ERR364233.subset.fastq | bioawk -cfastx '{print $name, length($seq)}'
ERR364233.2028526 246
ERR364233.576388 239
ERR364233.501486 54
ERR364233.1331889 233
ERR364233.1008347 148
3. How many sequences are shorter (less than 80bp)
bioawk -cfastx 'BEGIN{ shorter = 0} {if (length($seq) < 80) shorter += 1} END {print "shorter sequences", shorter}' ERR364233.subset.fastq
shorter sequences 12032
There are of course many other commands, as detailed in the tutorial, mostly on the Bioinformatics Workbook one.
Note
I was able to easily install bioawk
on my Mac with brew but having a docker option is great to use anywhere.