[Python-ideas] This seems like a wart to me...

Fri Dec 12 08:51:23 CET 2008

Stephen J. Turnbull wrote:

> I don't understand this point of view at all.  True, regexps are a
> complex subject, with an unfortunately large number of dialects.  Is
> it the confusion of dialects problem, or do you really never use
> regexps in any language?

I have half-heartedly tried to learn regexps before, but always given  
up after reading about the basics. Obviously, this would be shameless  
behavior for a professional programmer, but I'm just a dilettante, and  
the famed saying of Jamie Zawinski ("Some people, when confronted with  
a problem, think 'I know, I'll use regular expressions.'  Now they  
have two problems.") is not highly motivating. :-D

> Anyway, for this purpose you only have to learn one idiom, that
>
>    longstring.splitonchars (["x", "y", "z"])
>
> is spelled
>
>    import re
>    re.split ("[xyz]", longstring)
>
> In fact, I personally would like to deprecate the with-argument
> implementation of string.split(), and have
>
>    def split (self, delimiter = None):
>        if delimiters is None:
>            return self.usual_magic_splitting ()
>        else:
>            import re
>            return re.split (delimiter, self)
>
> (of course, that's because that's precisely the way split-string works
> in Emacs).
>
> Then the idiom would be
>
>    longstring.split ("[xyz]")
>
> Would that work for you?

Wouldn't that subtly break the code of everyone who has written  
something like:

lines = bigtext.splitlines()
delimiter = lines[0]
del lines[0]
splitlines = [line.split(delimiter) for line in lines]

? Since suddenly if your delimiter uses one of the reserved regexp  
characters, such as brackets and parentheses, the code would stop  
working. (That's one of the things I dislike about regexps -- too many  
magical characters.)

Here's a backward compatible idea instead:

    def split (self, delimiter = None):
        if delimiter is None:
            return self.usual_magic_splitting ()
        elif isinstance(delimiter, str):
            return self.usual_delimiter_based_splitting()
        elif isinstance(delimiter, Sequence):
            return  
self.treat_delimiters_given_by_sequence_as_interchangable()
        else:
            raise TypeError("coercing to Unicode: need string or  
buffer or Sequence, " + repr(type(delimiter)) + " found")

Since right now passing a list or tuple raises a TypeError, this would  
be backwards compatible. The idiom for doing re.split-like things  
would then be bigtext.split(list(" ;.,-!?")). It might even be a good  
idea to a keyword (only?) argument called "dropempty" to recreate the  
magical behavior of passing None as the delimiter where empty strings  
are dropped. That would also solve skip's original problem: just set  
it to text.split(None, dropempty=False).

-- Carl