[Python-ideas] This seems like a wart to me...
Carl Johnson
carl at carlsensei.com
Fri Dec 12 08:51:23 CET 2008
Stephen J. Turnbull wrote:
> I don't understand this point of view at all. True, regexps are a
> complex subject, with an unfortunately large number of dialects. Is
> it the confusion of dialects problem, or do you really never use
> regexps in any language?
I have half-heartedly tried to learn regexps before, but always given
up after reading about the basics. Obviously, this would be shameless
behavior for a professional programmer, but I'm just a dilettante, and
the famed saying of Jamie Zawinski ("Some people, when confronted with
a problem, think 'I know, I'll use regular expressions.' Now they
have two problems.") is not highly motivating. :-D
> Anyway, for this purpose you only have to learn one idiom, that
>
> longstring.splitonchars (["x", "y", "z"])
>
> is spelled
>
> import re
> re.split ("[xyz]", longstring)
>
> In fact, I personally would like to deprecate the with-argument
> implementation of string.split(), and have
>
> def split (self, delimiter = None):
> if delimiters is None:
> return self.usual_magic_splitting ()
> else:
> import re
> return re.split (delimiter, self)
>
> (of course, that's because that's precisely the way split-string works
> in Emacs).
>
> Then the idiom would be
>
> longstring.split ("[xyz]")
>
> Would that work for you?
Wouldn't that subtly break the code of everyone who has written
something like:
lines = bigtext.splitlines()
delimiter = lines[0]
del lines[0]
splitlines = [line.split(delimiter) for line in lines]
? Since suddenly if your delimiter uses one of the reserved regexp
characters, such as brackets and parentheses, the code would stop
working. (That's one of the things I dislike about regexps -- too many
magical characters.)
Here's a backward compatible idea instead:
def split (self, delimiter = None):
if delimiter is None:
return self.usual_magic_splitting ()
elif isinstance(delimiter, str):
return self.usual_delimiter_based_splitting()
elif isinstance(delimiter, Sequence):
return
self.treat_delimiters_given_by_sequence_as_interchangable()
else:
raise TypeError("coercing to Unicode: need string or
buffer or Sequence, " + repr(type(delimiter)) + " found")
Since right now passing a list or tuple raises a TypeError, this would
be backwards compatible. The idiom for doing re.split-like things
would then be bigtext.split(list(" ;.,-!?")). It might even be a good
idea to a keyword (only?) argument called "dropempty" to recreate the
magical behavior of passing None as the delimiter where empty strings
are dropped. That would also solve skip's original problem: just set
it to text.split(None, dropempty=False).
-- Carl
More information about the Python-ideas
mailing list