[New-bugs-announce] [issue31677] email.header uses re.IGNORECASE without re.ASCII

Tue Oct 3 08:58:02 EDT 2017

New submission from INADA Naoki <songofacandy at gmail.com>:

email.header has this pattern:

https://github.com/python/cpython/blob/85c0b8941f0c8ef3ed787c9d504712c6ad3eb5d3/Lib/email/header.py#L34-L43

# Match encoded-word strings in the form =?charset?q?Hello_World?=                       
ecre = re.compile(r'''                                                                   
  =\?                   # literal =?                                                     
  (?P<charset>[^?]*?)   # non-greedy up to the next ? is the charset                     
  \?                    # literal ?                                                      
  (?P<encoding>[qb])    # either a "q" or a "b", case insensitive                        
  \?                    # literal ?                                                      
  (?P<encoded>.*?)      # non-greedy up to the next ?= is the encoded string             
  \?=                   # literal ?=                                                     
  ''', re.VERBOSE | re.IGNORECASE | re.MULTILINE)

Since only 's' and 'i' has other lower case character, this is not a real bug.
But using re.ASCII is more safe.

Additionally, email.util has same pattern from 10 years ago, and it is not used by anywhere.
It should be removed.

----------
components: Regular Expressions
messages: 303612
nosy: ezio.melotti, inada.naoki, mrabarnett
priority: normal
severity: normal
status: open
title: email.header uses re.IGNORECASE without re.ASCII
versions: Python 3.7

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue31677>
_______________________________________