unicode(s, enc).encode(enc) == s ?

Thu Jan 3 09:33:42 EST 2008

Thanks again. I will chunk my responses as your message has too much
in it for me to process all at once...

On Jan 2, 9:34 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> > Thanks a lot Martin and Marc for the really great explanations! I was
> > wondering if it would be reasonable to imagine a utility that will
> > determine whether, for a given encoding, two byte strings would be
> > equivalent.
>
> But that is much easier to answer:
>
>   s1.decode(enc) == s2.decode(enc)
>
> Assuming Unicode's unification, for a single encoding, this should
> produce correct results in all cases I'm aware of.
>
> If the you also have different encodings, you should add
>
>   def normal_decode(s, enc):
>       return unicode.normalize("NFKD", s.decode(enc))
>
>   normal_decode(s1, enc) == normal_decode(s2, enc)
>
> This would flatten out compatibility characters, and ambiguities
> left in Unicode itself.

Hmmn, true, it would be that easy.

I am now not sure why I needed that check, or how to use this version
of it... I am always starting from one string, and decoding it... that
may be lossy when that is re-encoded, and compared to original.
However it is clear that the test above should always pass in this
case, so doing it seems superfluos.

Thanks for the unicodedata.normalize() tip.

mario