Regex driving me crazy...

Wed Apr 7 23:04:58 EDT 2010

On Apr 7, 9:51 pm, Steven D'Aprano
<ste... at REMOVE.THIS.cybersource.com.au> wrote:
> On Wed, 07 Apr 2010 18:03:47 -0700, Patrick Maupin wrote:
> > BTW, although I find it annoying when people say "don't do that" when
> > "that" is a perfectly good thing to do, and although I also find it
> > annoying when people tell you what not to do without telling you what
> > *to* do,
>
> Grant did give a perfectly good solution.

Yeah, I noticed later and apologized for that.  What he gave will work
perfectly if the only data that changes the number of words is the
data the OP is looking for.  This may or may not be true.  I don't
know anything about the program generating the data, but I did notice
that the OP's attempt at an answer indicated that the OP felt (rightly
or wrongly) he needed to split on two or more spaces.

>
> > and although I find the regex solution to this problem to be
> > quite clean, the equivalent non-regex solution is not terrible, so I
> > will present it as well, for your viewing pleasure:
>
> > >>> [x for x in '# 1  Short offline       Completed without error
> >       00%'.split('  ') if x.strip()]
> > ['# 1', 'Short offline', ' Completed without error', ' 00%']
>
> This is one of the reasons we're so often suspicious of re solutions:
>
> >>> s = '# 1  Short offline       Completed without error       00%'
> >>> tre = Timer("re.split(' {2,}', s)",
>
> ... "import re; from __main__ import s")>>> tsplit = Timer("[x for x in s.split('  ') if x.strip()]",
>
> ... "from __main__ import s")
>
> >>> re.split(' {2,}', s) == [x for x in s.split('  ') if x.strip()]
> True
>
> >>> min(tre.repeat(repeat=5))
> 6.1224789619445801
> >>> min(tsplit.repeat(repeat=5))
>
> 1.8338048458099365
>
> Even when they are correct and not unreadable line-noise, regexes tend to
> be slow. And they get worse as the size of the input increases:
>
> >>> s *= 1000
> >>> min(tre.repeat(repeat=5, number=1000))
> 2.3496899604797363
> >>> min(tsplit.repeat(repeat=5, number=1000))
> 0.41538596153259277
>
> >>> s *= 10
> >>> min(tre.repeat(repeat=5, number=1000))
> 23.739185094833374
> >>> min(tsplit.repeat(repeat=5, number=1000))
>
> 4.6444299221038818
>
> And this isn't even one of the pathological O(N**2) or O(2**N) regexes.
>
> Don't get me wrong -- regexes are a useful tool. But if your first
> instinct is to write a regex, you're doing it wrong.
>
>     [quote]
>     A related problem is Perl's over-reliance on regular expressions
>     that is exaggerated by advocating regex-based solution in almost
>     all O'Reilly books. The latter until recently were the most
>     authoritative source of published information about Perl.
>
>     While simple regular expression is a beautiful thing and can
>     simplify operations with string considerably, overcomplexity in
>     regular expressions is extremly dangerous: it cannot serve a basis
>     for serious, professional programming, it is fraught with pitfalls,
>     a big semantic mess as a result of outgrowing its primary purpose.
>     Diagnostic for errors in regular expressions is even weaker then
>     for the language itself and here many things are just go unnoticed.
>     [end quote]
>
> http://www.softpanorama.org/Scripting/Perlbook/Ch01/
> place_of_perl_among_other_lang.shtml
>
> Even Larry Wall has criticised Perl's regex culture:
>
> http://dev.perl.org/perl6/doc/design/apo/A05.html

Bravo!!! Good data, quotes, references, all good stuff!

I absolutely agree that regex shouldn't always be the first thing you
reach for, but I was reading way too much unsubstantiated "this is
bad.  Don't do it." on the subject recently.  In particular, when
people say "Don't use regex.  Use PyParsing!"  It may be good advice
in the right context, but it's a bit disingenuous not to mention that
PyParsing will use regex under the covers...

Regards,
Pat