recycling internationalized garbage

aaronwmail-usenet at yahoo.com aaronwmail-usenet at yahoo.com
Tue Mar 14 10:18:06 EST 2006


Regarding cleaning of mixed string encodings in
the discography search engine

http://www.xfeedme.com/discs/discography.html

Following </F>'s suggestion I came up with this:

utf8enc = codecs.getencoder("utf8")
utf8dec = codecs.getdecoder("utf8")
iso88591dec = codecs.getdecoder("iso-8859-1")

def checkEncoding(s):
    try:
        (uni, dummy) = utf8dec(s)
    except:
        (uni, dummy) = iso88591dec(s, 'ignore')
    (out, dummy) = utf8enc(uni)
    return out

This works nicely for Nordic stuff like
"björgvin halldórsson - gunnar Þórðarson",
but russian seems to turn into garbage
and I have no idea about chinese.

Unless someone has any other ideas I'm
giving up now.
   -- Aaron Watters

===

In theory, theory is the same as practice.
In practice it's more complicated than that.
  -- folklore




More information about the Python-list mailing list