A few questiosn about encoding

Thu Jun 13 06:02:38 EDT 2013

On Thu, 13 Jun 2013 12:01:55 +1000, Chris Angelico wrote:

> On Thu, Jun 13, 2013 at 11:40 AM, Steven D'Aprano
> <steve+comp.lang.python at pearwood.info> wrote:
>> The *mechanism* of UTF-8 can go up to 6 bytes (or even 7 perhaps?), but
>> that's not UTF-8, that's UTF-8-plus-extra-codepoints.
> 
> And a proper UTF-8 decoder will reject "\xC0\x80" and "\xed\xa0\x80", even
> though mathematically they would translate into U+0000 and U+D800
> respectively. The UTF-16 *mechanism* is limited to no more than Unicode
> has currently used, but I'm left wondering if that's actually the other
> way around - that Unicode planes were deemed to stop at the point where
> UTF-16 can't encode any more.

Indeed. 5-byte and 6-byte sequences were originally part of the UTF-8
specification, allowing for 31 bits. Later revisions of the standard
imposed the UTF-16 limit on Unicode as a whole.