[Python-ideas] Give regex operations more sugar

Wed Jun 13 07:06:09 EDT 2018

Hi all,

Regexes are really useful in many places, and to me it's sad to see the
builtin "re" module having to resort to requiring a source string as an
argument. It would be much more elegant to simply do "s.search(pattern)"
than "re.search(pattern, s)".
I suggest building all regex operations into the str class itself, as well
as a new syntax for regular expressions.

Thus a "findall" for any lowercase letter in a string would look like this:

    >>> "1a3c5e7g9i".findall(!%[a-z]%)
    ['a', 'c', 'e', 'g', 'i']

A "findall" for any letter, case insensitive:

    >>> "1A3c5E7g9I".findall(!%[a-z]%i)
    ['A', 'c', 'E', 'g', 'I']

A substitution of any letter for the string " WOOF WOOF ":

    >>> "1a3c5e7g9i".sub(!%[a-z]% WOOF WOOF %)
    '1 WOOF WOOF 3 WOOF WOOF 5 WOOF WOOF 7 WOOF WOOF 9 WOOF WOOF '

A substitution of any letter, case insensitive, for the string "hovercraft":

    >>> "1A3c5E7g9I".sub(!%[a-z]%hovercraft%i)
    '1hovercraft3hovercraft5hovercraft7hovercraft9hovercraft'

You may wonder why I chose the regex delimiters as "!%" ... "%" [ ... "%" ]
...
The choice of "%" was purely arbitrary; I just thought of it since there
seems to be a convention to use "%" in PHP regex patterns. The "!" is in
front to disambiguate it from the "%" modulo operator or the "%" string
formatting operator, and because "!" is currently not used in Python.
Another potential idea is to simply use "!" to denote the start of a regex,
and use the character immediately following it to delimit the regex. Thus
all of the following would be regexes matching a single lowercase letter:

    !%[a-z]%
    !#[a-z]#
    !?[a-z]?
    !/[a-z]/

And all of the following would be substitution regexes replacing a single
case-insensitive letter with "@":

    !%[a-z]%@%i
    !#[a-z]#@#i
    !?[a-z]?@?i
    !/[a-z]/@/i

Some examples of how to use this:

    >>> "pneumonoultramicroscopicsilicovolcanokoniosis".findall(!%[aeiou]+%)
    ['eu', 'o', 'ou', 'a', 'i', 'o', 'o', 'i', 'i', 'i', 'o', 'o', 'a',
'o', 'o', 'io', 'i']
    >>>
"GMzKqtnnyGdqIQNlQSLidbDlqpdhoRbHrrUAgyhMgkZKYVhQuI".search(!%[^A-Z][A-Z]{3}([a-z])[A-Z]{3}[^A-Z]%)
    <regex_match; span=(11, 20); match='qIQNlQSLi'>
    >>> "My name is Joanne.".findall(!%[A-Z][a-z]+%)
    ['My', 'Joanne']

Thoughts?
Sincerely,
Ken;
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20180613/7de2df91/attachment.html>