[issue5610] email feedparser.py CRLFLF bug: $ vs \Z

Tony Nelson report at bugs.python.org
Mon Mar 30 19:59:37 CEST 2009


New submission from Tony Nelson <tony_nelson at users.sourceforge.net>:

feedparser.py does not pares mixed newlines properly.  NLCRE_eol, which
is used to search for the various newlines at End Of Line, uses $ to
match the end of string, but $ also matches \n$, due to a wise long-ago
patch by the Effbot.  This causes feedparser to match '\r\n\n' at
'\r\n', and then to remove the last two characters, leaving '\r', thus
eating up a line.  Such mixed line endings can occur if a message with
CRLF line endings is parsed, written out, and then parsed again.

When explicitly searching for various newlines, the \Z end-of-string
marker should be used instead.  There are two improper uses of $ in
feedparser.py.  I don't see any others in the email package.

NLCRE_eol = re.compile('(\r\n|\r|\n)$')

should be:

NLCRE_eol = re.compile('(\r\n|\r|\n)\Z')

and boundary_re also needs the fix.

I can write a test.  Where exactly should it be put?

----------
components: Library (Lib)
files: feedparser_crlflf.patch
keywords: patch
messages: 84595
nosy: barry, tony_nelson
severity: normal
status: open
title: email feedparser.py CRLFLF bug: $ vs \Z
versions: Python 2.6
Added file: http://bugs.python.org/file13476/feedparser_crlflf.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue5610>
_______________________________________


More information about the Python-bugs-list mailing list