Unicode utf-8 doesn't do back-and-forth?

Martin v. Löwis loewis at informatik.hu-berlin.de
Mon Jul 8 10:05:25 EDT 2002


Piet van Oostrum <piet at cs.uu.nl> writes:

> Well, I looked into the Unicode specs and it says that even if single
> surrogates appear in a string, the UTF-8 encoding should generate a valid
> UTF-8 byte sequence, which on encoding should give the same surrogate. So
> I would say this is a bug in the UTF-8 encoding.

Which Unicode specs did you look at? Unicode TR #28 (aka Unicode 3.2),

http://www.unicode.org/unicode/reports/tr28/

says

<quote>
The definition of transformation formats such as UTF-8 allowed
conformant processes to interpret certain sequences called irregular
sequences. These irregular sequences are those that would be produced
by transforming supplementary code points as if they were a sequence
of two surrogate code points.

To tighten the definitions, in Unicode 3.2 such irregular sequences
are now illegal.
<quote>

Table 3.1B of the same document explicitly lists the byte sequences
that would denote code points D800-D8FF as illegal.

There is special permission given to recovery tools to deal with
irregular or illegal sequences without indicating an error, but the
standard Python UTF-8 codec certainly does not fall into this
category.

Regards,
Martin



More information about the Python-list mailing list