What is wrong with this regex for matching emails?

Sun Dec 17 15:57:27 EST 2017

Peng Yu <pengyu.ut at gmail.com> writes:

> Hi,
>
> I would like to extract "abc at efg.hij.xyz". But it only shows ".hij".

Others have address this question. I'll answer a separate one:

> Does anybody see what is wrong with it? Thanks.

One thing that's wrong with it is that it is far too restrictive.

> email_regex = re.compile('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)')

This excludes a great many email addresses that are valid. Please don't
try to restrict a match for email addresses that will exclude actual
email addresses.

For an authoritative guide to matching email addresses, see RFC 3696 §3
<URL:https://tools.ietf.org/html/rfc3696#section-3>.

A more correct match would boil down to:

* Match any printable Unicode characters (not just ASCII).

* Locate the *last* ‘@’ character. (An email address may contain more
  than one ‘@’ character; you should allow any printable ASCII character
  in the local part.)

* Match the domain part as the text after the last ‘@’ character. Match
  the local part as anything before that character. Reject an address
  that has either of these empty.

* Validate the domain by DNS request. Your program is not an authority
  for what domains are valid; the only authority for that is the DNS.

* Don't validate the local part at all. Your program is not an authority
  for what local parts are accepted to the destination host; the only
  authority for that is the destination mail host.

-- 
 \     “Jealousy: The theory that some other fellow has just as little |
  `\                                         taste.” —Henry L. Mencken |
_o__)                                                                  |
Ben Finney