[issue14068] problem with re split

Wed Feb 22 03:01:12 CET 2012

Ezio Melotti <ezio.melotti at gmail.com> added the comment:

As long as you don't mix str and unicode everything works.

With strings:
>>> s = '与清新。阿德莱'
>>> re.split('。', s)
['\xe4\xb8\x8e\xe6\xb8\x85\xe6\x96\xb0', '\xe9\x98\xbf\xe5\xbe\xb7\xe8\x8e\xb1']
>>> s.split('。')
['\xe4\xb8\x8e\xe6\xb8\x85\xe6\x96\xb0', '\xe9\x98\xbf\xe5\xbe\xb7\xe8\x8e\xb1']

With unicode:
>>> u = u'与清新。阿德莱'
>>> re.split(u'。', u)
[u'\u4e0e\u6e05\u65b0', u'\u963f\u5fb7\u83b1']
>>> u.split(u'。')
[u'\u4e0e\u6e05\u65b0', u'\u963f\u5fb7\u83b1']

Mixing str and unicode:
>>> re.split(u'。', s)
['\xe4\xb8\x8e\xe6\xb8\x85\xe6\x96\xb0\xe3\x80\x82\xe9\x98\xbf\xe5\xbe\xb7\xe8\x8e\xb1']
>>> re.split('。', u)
[u'\u4e0e\u6e05\u65b0\u3002\u963f\u5fb7\u83b1']
>>>
>>> s.split(u'。')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
>>> u.split('。')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)

The syntax error is raised for byte literals and can't be backported to 2.7.  Raising an error when str and unicode are mixed in re is not backward compatible, and re does work as long as both are ASCII only.  I'm therefore closing this as invalid.

----------
nosy: +mrabarnett
resolution:  -> invalid
stage:  -> committed/rejected
status: open -> closed

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue14068>
_______________________________________