What is wrong with this regex for matching emails?

Sun Dec 17 10:46:55 EST 2017

On Mon, Dec 18, 2017 at 2:29 AM, Peng Yu <pengyu.ut at gmail.com> wrote:
> Hi,
>
> I would like to extract "abc at efg.hij.xyz". But it only shows ".hij".
> Does anybody see what is wrong with it? Thanks.
>
> $ cat main.py
> #!/usr/bin/env python
> # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:
>
> import re
> email_regex = re.compile('[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)')
> s = 'abc at efg.hij.xyz.'
> for email in re.findall(email_regex, s):
>     print email
>
> $ ./main.py
> .hij

What is the goal of your email address extraction? There are two
goals, one of which cannot be done perfectly but doesn't need to, and
the other cannot be done perfectly and is thus virtually useless. If
you want to detect email addresses in text and turn them into mailto:
links, it's okay to miss out some edge cases, and for that, I would
recommend keeping your regex REALLY simple - something like you have
above, but maybe even simpler. (And I wouldn't have the parentheses in
there, which I think might be what you're getting tripped up on.) But
if you're trying to *validate* an email address - for instance, if you
receive a form submission and want to know if there was an email
address included - then my recommendation is simply DON'T. You can't
get all the edge cases right; it is actually impossible for a regex to
perfectly match every valid email address and no invalid addresses.
And that's only counting *syntactically* valid - it doesn't take into
account the fact that "blah at junk.example.com" is not going to get
anywhere. So if you're trying to do validation, basically just don't.

ChrisA