I have performed a search for creating a set of down-sampled data from an actual large dataset, and while there are many creative information on BioStar and other forums, I find that the most versatile and easy to use tool would be one recommended on the forums:
seqtk which is available on Github: github.com/lh3/seqtk
Quoting the readme page: “Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.”
The use of a “seed” allows random sampling of paired-ends by simply providing the same seed. In their example that appears as
-s100. From the Readme file:
- Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):
seqtk sample -s100 read1.fq 10000 > sub1.fq seqtk sample -s100 read2.fq 10000 > sub2.fq
I tested this with fast.gz files as input and it worked, the sub*.fq files were not compressed, but that is a minor inconvenience.
But it is also available as a Docker image, therefore there is no need to compile the program and install it. It also means that one can run it on a Mac or Windows system with properly installed Docker!
Note: Mac users may also install
seqtk using brew