Down-sampling FASTQ.gz paired ends

By | December 16, 2019

Downsampling

I have performed a search for creating a set of down-sampled data from an actual  large dataset, and while there are many creative information on BioStar and other forums, I find that the most versatile and easy to use tool would be one recommended on the forums: seqtk which is available on Github: github.com/lh3/seqtk 

Quoting the readme page: “Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.

The use of a “seed” allows random sampling of paired-ends by simply providing the same seed. In their example that appears as -s100. From the Readme file:

 

  • Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):
      seqtk sample -s100 read1.fq 10000 > sub1.fq
      seqtk sample -s100 read2.fq 10000 > sub2.fq
    

I tested this with fast.gz files as input and it worked, the sub*.fq files were not compressed, but that is a minor inconvenience.

Docker images

But it is also available as a Docker image, therefore there is no need to compile the program and install it. It also means that one can run it on a Mac or Windows system with properly installed Docker!

Images are available on the Docker hub. I have used and tested the one called “Most Popular”: https://hub.docker.com/r/dukegcb/seqtk

Note: Mac users may also install seqtk using brew

Share this:

Leave a Reply