[New-bugs-announce] [issue8922] Improve encoding shortcuts in PyUnicode_AsEncodedString()

Sun Jun 6 20:23:57 CEST 2010

New submission from STINNER Victor <victor.stinner at haypocalc.com>:

PyUnicode_Decode() and PyUnicode_AsEncodedString() calls directly builtin decoders/encoders for some known encodings (eg. "utf-8"), instead of using the slow path (call PyCodec_Decode() / PyCodec_Encode()). 

PyUnicode_Decode() does normalize the encoding name: convert to lower and replace "_" by "-", as normalizestring() does. But PyUnicode_AsEncodedString() doesn't normalize the encoding name, it just use strcmp(). PyUnicode_Decode() has a shortcut for ISO-8859-1, whereas PyUnicode_AsEncodedString() doesn't (only for "latin-1").

Attached patch creates a subfunction (static) normalize_encoding(), use it in PyUnicode_Decode() and PyUnicode_AsEncodedString(), and adds a shortcut for ISO-8859-1 to PyUnicode_AsEncodedString().

----------
components: Unicode
files: unicode_shortcuts.patch
keywords: patch
messages: 107203
nosy: haypo, pitrou
priority: normal
severity: normal
status: open
title: Improve encoding shortcuts in PyUnicode_AsEncodedString()
type: performance
versions: Python 3.2
Added file: http://bugs.python.org/file17574/unicode_shortcuts.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue8922>
_______________________________________