Some language implementations have hash tables built-in; some provide a hash table as part of a collections library; some use a third-party hash table library. (For example, use either khash or CK_HT for C language k-nucleotide programs.) The hash table algorithm implemented is likely to be different in different libraries.
Please don't implement your own custom "hash table" - it will not be accepted.
The work is to use the built-in or library hash table implementation to accumulate count values - lookup the count for a key and update the count in the hash table. Don't optimize away the work.
Mapping the DNA letters to the bytes 0, 1, 2, 3, and using a hash function that concatenates those two-byte codes is an acceptable (but not a required) optimization.
How to implement
We ask that contributed programs not only give the correct result, but also use the same algorithm to calculate that result.
Each program should:
read line-by-line a redirected FASTA format file from stdin
extract DNA sequence THREE
define a procedure/function to update a hashtable of k-nucleotide keys and count values, for a particular reading-frame — even though we'll combine k-nucleotide counts for all reading-frames (grow the hashtable from a small default size)
use that procedure/function and hashtable to
count all the 1-nucleotide and 2-nucleotide sequences, and write the code and percentage frequency, sorted by descending frequency and then ascending k-nucleotide key
count all the 3- 4- 6- 12- and 18-nucleotide sequences, and write the count and code for the specific sequences GGT GGTA GGTATT GGTATTTTAATT GGTATTTTAATTTATAGT
diff program output for this 10KB input file (generated with the fasta program N = 1000) with this output file to check your program output has the correct format, before you contribute your program.
Generate a larger input file (using one of the fasta programs with command line arguments: 25000000 > input25000000.txt) to check program performance.