python regex: misbehaviour with "\r" (0x0D) as Newline character in Unicode Mode

Arian Sanusi arian at sanusi.de
Sun Jan 27 06:30:30 EST 2008


Hi,

concerning to unicode, "\n", "\r "and "\r\n" (0x000A, 0x000D and 
0x000D+0x000A) should be threatened as newline character
at least this is how i understand it: 
(http://en.wikipedia.org/wiki/Newline#Unicode)

obviously, the re module does not care, and on unix, only threatens \n 
as newline char:

 >>> a=re.compile(u"^a",re.U|re.M)
 >>> a.search(u"bc\ra")
 >>> a.search(u"bc\na")
<_sre.SRE_Match object at 0xb5908fa8>

same thing for $:
 >>> b = re.compile(u"c$",re.U|re.M)
 >>> b.search(u"bc\r\n")
 >>> b.search(u"abc")
<_sre.SRE_Match object at 0xb5908f70>
 >>> b.search(u"bc\nde")
<_sre.SRE_Match object at 0xb5908fa8>

is this a known bug in the re module? i couldn't find any issues in the 
bug tracker.
Or is this just a user fault and you guys can help me?

arian

p.s.: appears in both python2.4 and 2.5



More information about the Python-list mailing list