Need a specific sort of string modification. Can someone help?

Sat Jan 5 10:57:29 EST 2013

On Sun, Jan 6, 2013 at 2:38 AM, Roy Smith <roy at panix.com> wrote:
> In article <mailman.121.1357398573.2939.python-list at python.org>,
>  Chris Angelico <rosuav at gmail.com> wrote:
>
>> it may or may not run faster than the explicit state machine,
>
> You got me by a factor of 3 or 4.  Not bad.

You miss my point, though. I went for simple Pythonic code, and never
measured its performance, on the expectation that it's "good enough".
Written in C, the state machine is probably WAY faster than splitting
and then iterating. My C++ MUD client uses code similar to that to
parse TELNET and ANSI codes from a stream of bytes in a socket (and
one of its "states" is that there's no more data available, so wait on
the socket); the rewrite in a high level language divides the string
on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
then afterward splits on "\n" to divide into lines. The code's much
less convoluted, it's easier to test different parts (because I can
simply call the ANSI parser with a block of text), and on a modern
computer, you can't see the performance difference (since you spend
most of your time waiting for socket data anyway).

But it's gratifying to know that the obvious and brief way to do
things is fast too :)

>> but IMHO it's a lot clearer to read a split() than something
>> that changes state when a particular character is found.
>
> Maybe.  But being familiar with state machines is still a handy skill.
> DNA sequence analysis has lots of problems like "find a start codon
> which is within about 50 bases of a binding site, and then copy
> everything up until you find a stop codon".  Things like that often map
> well to state machines.  Especially if you're trying to do it in
> parallel in all three reading frames.

Sure. And if you're working with petabytes of data, these
considerations become fairly important. When that happens, you start
rewriting your algorithms in C, or using Cython, or something; at very
least, you start rewriting clear and simple algorithms into more
complex ones. But all this happens *after* the code has been tested
and proven. All the rewrites can be verified as being identical to
their reference implementations; you can test one piece at a time as
you change them. It's ever so much easier to work that way.

<anecdote>At work, we had one employee whose code was, shall we say,
less than stellar. At one point, he embarked on a months-long rewrite
of one of his modules; meanwhile, I was unable to adequately test code
that called on it. Once the rewrite was finally complete, I discovered
myriad bugs in my own code, ones that would have been found and fixed
instantly if I'd had even a slow version of the code to work against.
Starting with something you can easily debug helps enormously with
that, because debugging doesn't demand mega-TPS throughput.</anecdote>

ChrisA