Expression can be simplified on list

Jussi Piitulainen jussi.piitulainen at helsinki.fi
Wed Sep 14 08:05:41 EDT 2016


Serhiy Storchaka writes:

> On 14.09.16 11:28, Steven D'Aprano wrote:

>> I'm not sure if it makes sense to talk about an "empty regular
>> expression", or if that is the same as re.compile(''). Presumably if
>> such a thing makes sense, it would match nothing, from any input at
>> all.
>
> Actually, it matches anything. re.compile('').match(x) returns
> non-false value for any valid x.

It matches the empty string. The .match method returns a match object if
the regex matches at the start of the input. See span:

re.compile('').match('foo')
===> <_sre.SRE_Match object; span=(0, 0), match=''>

There's an empty string at every position, same content, different span:

re.compile('').findall('foo')
===> ['', '', '', '']

More mathematical treatments of regular expressions usually include a
special expression that stands for the empty string, typically ε or λ,
I've also seen Λ, sometimes e -- my impression is that those sources
tend to use small alphabets, often just two or three letters, so they
can afford to give up e.

Some ruminations follow, following Rustom's mathematical ruminations,
related to extended regex software that has taken a rather different
direction from the usual matching engines.

Mathematical discussions tend to acknowledge only alternation (union),
concatenation and iteration (Kleene star) as operations, unless they
specifically focus on some other operations that can, in principle, be
expressed in terms of those three. There are many such.

Without further regex operations it makes sense to include a special
regular expression that matches nothing.  Otherwise a very simple NFA
(DFA) has no corresponding regex, which can be considered awkward.

There are powerful regex operations for which it may not be
computationally trivial, regarding time and space, to tell whether the
regex matches anything at all. Does one expect bool(o) to happen
quickly? in near-constant time? I think there tends to be some such
expectation. Then it might make practical sense to not interpret a regex
as a container of the strings that it matches, even if such an
interpretation makes some theoretical sense. It may make sense to keep
the regex and its interpretation as a set (or relation) separate in
theory, too, just for the clarity of thought.

My colleagues down the corridor deal with large finite-state automata
and transducers, applying a rich regex formalism to analyze the various
forms that words can take. Some of them used to be a challenge to
compile at all - a Greenlandish lexicon was an example. - It looks like
the move of the software to GitHub has happened: http://hfst.github.io/.



More information about the Python-list mailing list