[issue18059] Add multibyte encoding support to pyexpat

Fri Nov 22 23:53:40 CET 2013

Serhiy Storchaka added the comment:

> I'm not sure that multibyte encodings other than UTF-8 are used in the world.

I don't use any of them but I heard some of them are still widely used.

This issue was provoked by issue13612. See also related issue15877.

> pyexpat_encoding_create() looks like an heuristic. How many multibyte codecs can be used with your patch?

All codecs which can be supported by expat.

"""
   1. Every ASCII character that can appear in a well-formed XML document,
      other than the characters

      $@\^`{}~

      must be represented by a single byte, and that byte must be the
      same byte that represents that character in ASCII.

   2. No character may require more than 4 bytes to encode.

   3. All characters encoded must have Unicode scalar values <=
      0xFFFF, (i.e., characters that would be encoded by surrogates in
      UTF-16 are  not allowed).  Note that this restriction doesn't
      apply to the built-in support for UTF-8 and UTF-16.

   4. No Unicode character may be encoded by more than one distinct
      sequence of bytes.
"""

14 Python encodings satisfy these criteria: big5, big5hkscs, cp932, cp949, cp950, euc-jp, euc-jis-2004, euc-jisx0213, gb2312, gbk, johab, shift-jis, shift-jis-2004, shift-jisx0213.

> A whitelist of multibyte codecs may be less reliable. What do you think?

pyexpat_multibyte_encodings_4.patch implements this way. It hardcodes a list of supported encodings with minimal required tables.

pyexpat_multibyte_encodings_5.patch supports any encoding which satisfy expat criteria and builds all needed data at first access (tens kilobytes). After heavy start it works much faster than previous patch.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue18059>
_______________________________________