The Computer Language
Benchmarks Game

regex-dna Python 3 #5 program

source code

# The Computer Language Benchmarks Game
# http://shootout.alioth.debian.org/
# contributed by Dominique Wahli
# 2to3
# modified by Justin Peel

from sys import stdin,stdout
from re import sub, findall

def main():
    seq = stdin.buffer.read()
    write = stdout.buffer.write
    ilen = len(seq)

    seq = sub(b'>.*\n|\n', b'', seq)
    clen = len(seq)

    variants = (
          b'agggtaaa|tttaccct',
          b'[cgt]gggtaaa|tttaccc[acg]',
          b'a[act]ggtaaa|tttacc[agt]t',
          b'ag[act]gtaaa|tttac[agt]ct',
          b'agg[act]taaa|ttta[agt]cct',
          b'aggg[acg]aaa|ttt[cgt]ccct',
          b'agggt[cgt]aa|tt[acg]accct',
          b'agggta[cgt]a|t[acg]taccct',
          b'agggtaa[cgt]|[acg]ttaccct')
    for f in variants:
        write(f + b' ' +bytes(str(len(findall(f, seq))),encoding='latin1') + b'\n')

    subst = {
          b'B' : b'(c|g|t)', b'D' : b'(a|g|t)',   b'H' : b'(a|c|t)', b'K' : b'(g|t)',
          b'M' : b'(a|c)',   b'N' : b'(a|c|g|t)', b'R' : b'(a|g)',   b'S' : b'(c|g)',
          b'V' : b'(a|c|g)', b'W' : b'(a|t)',     b'Y' : b'(c|t)'}
    for f, r in subst.items():
        seq = sub(f, r, seq)
    write(b'\n')
    write(bytes(str(ilen),encoding='latin1') + b'\n')
    write(bytes(str(clen),encoding='latin1') + b'\n')
    write(bytes(str(len(seq)),encoding='latin1') + b'\n')

main()
    

notes, command-line, and program output

NOTES:
32-bit Ubuntu one core
Python 3.5.0 (default, Sep 14 2015, 09:36:50) 
[GCC 4.9.2] on linux


Tue, 15 Sep 2015 07:11:09 GMT

MAKE:
mv regexdna.python3-5.python3 regexdna.python3-5.py
0.03s to complete and log all make actions

COMMAND LINE:
/usr/local/src/Python-3.5.0/bin/python3.5 regexdna.python3-5.py 0 < regexdna-input5000000.txt

PROGRAM OUTPUT:
agggtaaa|tttaccct 356
[cgt]gggtaaa|tttaccc[acg] 1250
a[act]ggtaaa|tttacc[agt]t 4252
ag[act]gtaaa|tttac[agt]ct 2894
agg[act]taaa|ttta[agt]cct 5435
aggg[acg]aaa|ttt[cgt]ccct 1537
agggt[cgt]aa|tt[acg]accct 1431
agggta[cgt]a|t[acg]taccct 1608
agggtaa[cgt]|[acg]ttaccct 2178

50833411
50000000
66800214