What is wrong with this regex for matching emails?

Chris Angelico rosuav at gmail.com
Mon Dec 18 02:01:37 EST 2017


On Mon, Dec 18, 2017 at 5:43 PM, Random832 <random832 at fastmail.com> wrote:
> On Sun, Dec 17, 2017, at 10:46, Chris Angelico wrote:
>> But if you're trying to *validate* an email address - for instance, if
>> you receive a form submission and want to know if there was an email
>> address included - then my recommendation is simply DON'T. You can't
>> get all the edge cases right; it is actually impossible for a regex to
>> perfectly match every valid email address and no invalid addresses.
>
> That's not actually true (the thing that notoriously can't be matched in
> a regex, RFC822 "address", is basically most of the syntax of the To:
> header - the part that is *the address* as we speak of it normally is
> "addr-spec" and is in fact a regular language, though a regex to match
> it goes on for a few hundred characters.

Hmm, is that true? I was under the impression that the quoting rules
were impossible to match with a regex. Or maybe it's just that they're
impossible to match with a *standard* regex, but the extended
implementations (including Python's, possibly) are able to match them?

Anyhow, it is FAR from simple; and also, for the purpose of "detect
email addresses in text documents", not desirable. Same as with URL
detection - it's better to have a handful of weird cases that don't
autolink correctly than to mis-detect any address that's at the end of
a sentence, for instance. For that purpose, it's better to ignore the
RFC and just craft a regex that matches *common* email address
formats.

ChrisA



More information about the Python-list mailing list