Some language implementations have regex built-in; some provide a regex library; some use a third-party regex library.
The regex algorithm implemented is very likely to be different in different libraries.
The work is to use the same simple regex patterns and actions to manipulate FASTA format data. Don't optimize away the work.
How to implement
We ask that contributed programs not only give the correct result, but also use the same algorithm to calculate that result.
Each program should:
read all of a redirected FASTA format file from stdin, and record the sequence length
use the same simple regex pattern match-replace to remove FASTA sequence descriptions and all linefeed characters, and record the sequence length
use the same simple regex patterns -
agggtaaa|tttaccct [cgt]gggtaaa|tttaccc[acg] a[act]ggtaaa|tttacc[agt]t ag[act]gtaaa|tttac[agt]ct agg[act]taaa|ttta[agt]cct aggg[acg]aaa|ttt[cgt]ccct agggt[cgt]aa|tt[acg]accct agggta[cgt]a|t[acg]taccct agggtaa[cgt]|[acg]ttaccct
- representing DNA 8-mers and their reverse complement (with a wildcard in one position), and (one pattern at a time) count matches in the redirected file
write the regex pattern and count
use the same magic regex patterns -
tHa[Nt] aND|caN|Ha[DS]|WaS a[NSt]|BY <[^>]*> \\|[^|][^|]*\\|
- to (one pattern at a time, in the same order) match-replace the pattern in the redirected file with -
<4> <3> <2> | -
- and record the sequence length
write the 3 recorded sequence lengths
diff program output for this 10KB input file (generated with the fasta program N = 1000) with this output file to check your program output has the correct format, before you contribute your program.
Generate a larger input file (using one of the fasta programs with command line arguments: 5000000 > input5000000.txt) to check program performance.
Thanks to Jeremy Zerfas for insisting that the programs follow the "one pattern at a time" guideline, and developing the magic regex patterns. Thanks to Matt Brubeck for the good enough magic regex pattern.