Some language implementations have regex built-in; some provide a regex library; some use a third-party regex library. Some – coincidentally – reduce this work to substring matching. (Remember Hennessy and Patterson's warning.)
The regex algorithm implemented is very likely to be different in different libraries.
The work is to use the same simple regex patterns and actions to manipulate FASTA format data. Don't optimize away the work.
How to implement
We ask that contributed programs not only give the correct result, but also use the same algorithm to calculate that result.
Each program should:
read all of a redirected FASTA format file from stdin, and record the sequence length
use the same simple regex pattern match-replace to remove FASTA sequence descriptions and all linefeed characters, and record the sequence length
use the same simple regex patterns, representing DNA 8-mers and their reverse complement (with a wildcard in one position), and (one pattern at a time) count matches in the redirected file
write the regex pattern and count
use the same simple regex patterns to make IUB code alternatives explicit, and (one pattern at a time) match-replace the pattern in the redirect file, and record the sequence length
write the 3 recorded sequence lengths
diff program output for this 10KB input file (generated with the fasta program N = 1000) with this output file to check your program output has the correct format, before you contribute your program.
Generate a larger input file (using one of the fasta programs with command line arguments: 5000000 > input5000000.txt) to check program performance.