unicode(s, enc).encode(enc) == s ?

mario mario at ruggier.org
Thu Jan 3 09:33:42 EST 2008


Thanks again. I will chunk my responses as your message has too much
in it for me to process all at once...

On Jan 2, 9:34 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> > Thanks a lot Martin and Marc for the really great explanations! I was
> > wondering if it would be reasonable to imagine a utility that will
> > determine whether, for a given encoding, two byte strings would be
> > equivalent.
>
> But that is much easier to answer:
>
>   s1.decode(enc) == s2.decode(enc)
>
> Assuming Unicode's unification, for a single encoding, this should
> produce correct results in all cases I'm aware of.
>
> If the you also have different encodings, you should add
>
>   def normal_decode(s, enc):
>       return unicode.normalize("NFKD", s.decode(enc))
>
>   normal_decode(s1, enc) == normal_decode(s2, enc)
>
> This would flatten out compatibility characters, and ambiguities
> left in Unicode itself.

Hmmn, true, it would be that easy.

I am now not sure why I needed that check, or how to use this version
of it... I am always starting from one string, and decoding it... that
may be lossy when that is re-encoded, and compared to original.
However it is clear that the test above should always pass in this
case, so doing it seems superfluos.

Thanks for the unicodedata.normalize() tip.

mario




More information about the Python-list mailing list