unicode(s, enc).encode(enc) == s ?

mario mario at ruggier.org
Fri Dec 28 06:00:59 EST 2007


On Dec 27, 7:37 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> Certainly. ISO-2022 is famous for having ambiguous encodings. Try
> these:
>
> unicode("Hallo","iso-2022-jp")
> unicode("\x1b(BHallo","iso-2022-jp")
> unicode("\x1b(JHallo","iso-2022-jp")
> unicode("\x1b(BHal\x1b(Jlo","iso-2022-jp")
>
> or likewise
>
> unicode("\x1b$@BB","iso-2022-jp")
> unicode("\x1b$BBB","iso-2022-jp")
>
> In iso-2022-jp-3, there are even more ways to encode the same string.

Wow, that's not easy to see why would anyone ever want that? Is there
any logic behind this?

In your samples both of unicode("\x1b(BHallo","iso-2022-jp") and
unicode("\x1b(JHallo","iso-2022-jp") give u"Hallo" -- does this mean
that the ignored/lost bytes in the original strings are not illegal
but *represent nothing* in this encoding?

I.e. in practice (in a context limited to the encoding in question)
should this be considered as a data loss, or should these strings be
considered "equivalent"?

Thanks!

mario



More information about the Python-list mailing list