A few questiosn about encoding

Chris Angelico rosuav at gmail.com
Wed Jun 12 22:01:55 EDT 2013


On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but
> that's not UTF-8, that's UTF-8-plus-extra-codepoints.

And a proper UTF-8 decoder will reject "\xC0\x80" and "\xed\xa0\x80",
even though mathematically they would translate into U+0000 and U+D800
respectively. The UTF-16 *mechanism* is limited to no more than
Unicode has currently used, but I'm left wondering if that's actually
the other way around - that Unicode planes were deemed to stop at the
point where UTF-16 can't encode any more. Not that it matters; with
most of the current planes completely unallocated, it seems unlikely
we'll be needing more.

ChrisA



More information about the Python-list mailing list