how can I convert invalid ASCII string to Unicode?

skip at pobox.com skip at pobox.com
Wed May 9 05:36:30 EDT 2001


    Tim> I'm afraid the docs aren't a lot of help here, either.  There's a
    Tim> very nice Grand Architecture that's been reduced to a handful of
    Tim> ambiguously defined builtin functions without any examples --
    Tim> painful.

Thanks.  I was beginning to think I missed a link in the docs somewhere.  I
will file a big report against the unicode() doc string.  It would at least
be nice to know what the possible options are for the "errors" parameter
without reading the source...

    Tim> If you believe your binary blobs were meant to interpreted as
    Tim> Latin-1, then tell the unicode() function explicitly:

    >>>> unicode("\xf6", "latin-1")
    u'\xf6'

This is what I missed.  I tried stuff like

    s.encode("UTF-8")
    s.encode("Latin-1")

which kept failing.

Okay, so I'm set now, I think.  One further question.  As I was glancing
through unicodeobject.c just now (actually, it's large enough that "glance"
might not be the correct term ;-), I noticed that when testing for default
encodings a case-sensitive comparison against lowercase encoding names is
used.  This results in substantially slower conversions if you are silly
enought to use common printed representations of encodings (e.g.  "Latin-1"
instead of "latin-1" or "UTF-8" instead of "utf-8") as your encoding:

    >>> import time,string
    >>> def timeenc(enc, n):
    ...   t = time.time()
    ...   for i in xrange(n):
    ...     u = unicode(string.letters, enc)
    ...   return time.time()-t
    ... 
    >>> timeenc("Latin-1", 1000)
    0.090107917785644531
    >>> timeenc("latin-1", 1000)
    0.011034011840820312

Shouldn't the tests for common encodings in PyUnicode_AsEncodedString and
PyUnicode_Decode use strcasecmp?

Skip




More information about the Python-list mailing list