[issue6233] ElementTree (py3k) doesn't properly encode characters that can't be represented in the specified encoding

Tue Jun 23 07:37:27 CEST 2009

Jerry Chen <jerry at 3rdengine.com> added the comment:

The attached patch includes Neil's original additions to test_xml_etree.py.

I also noticed that _encode_entity wasn't being called in ElementTree in
py3k, with the important bit being the nested function
escape_entities(), in conjunction with _escape and _escape_map.

In 2.x, _encode_entity() is used after _encode() throws Unicode
exceptions [1], so I figured it would make sense to take the core
functionality of _escape_entities() and integrate it into _encode in the
same fashion -- when an exception is thrown.

Basically, I:
- changed _escape regexp from using "[\x0080-\uffff]" to "[\x80-xff]"
- extracted _encode_entity.escape_entities() and made it
_escape_entities of module scope
- removed _encode_entity()
- added UnicodeEncodeError exception in _encode()

I'm not sure what the expected outcome is supposed to be when the text
is not type bytes but str. With this patch, the output has
b"t&#195;&#163;t" rather than b"t&#227;t".

Hope this is a step in the right direction.

[1] ElementTree.py:814, ElementTree.py:829, python 2.7 HEAD r50941

----------
nosy: +jcsalterego
Added file: http://bugs.python.org/file14340/issue6233-escape_entities.diff

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue6233>
_______________________________________