how can I convert invalid ASCII string to Unicode?
skip at pobox.com
skip at pobox.com
Wed May 9 05:36:30 EDT 2001
Tim> I'm afraid the docs aren't a lot of help here, either. There's a
Tim> very nice Grand Architecture that's been reduced to a handful of
Tim> ambiguously defined builtin functions without any examples --
Tim> painful.
Thanks. I was beginning to think I missed a link in the docs somewhere. I
will file a big report against the unicode() doc string. It would at least
be nice to know what the possible options are for the "errors" parameter
without reading the source...
Tim> If you believe your binary blobs were meant to interpreted as
Tim> Latin-1, then tell the unicode() function explicitly:
>>>> unicode("\xf6", "latin-1")
u'\xf6'
This is what I missed. I tried stuff like
s.encode("UTF-8")
s.encode("Latin-1")
which kept failing.
Okay, so I'm set now, I think. One further question. As I was glancing
through unicodeobject.c just now (actually, it's large enough that "glance"
might not be the correct term ;-), I noticed that when testing for default
encodings a case-sensitive comparison against lowercase encoding names is
used. This results in substantially slower conversions if you are silly
enought to use common printed representations of encodings (e.g. "Latin-1"
instead of "latin-1" or "UTF-8" instead of "utf-8") as your encoding:
>>> import time,string
>>> def timeenc(enc, n):
... t = time.time()
... for i in xrange(n):
... u = unicode(string.letters, enc)
... return time.time()-t
...
>>> timeenc("Latin-1", 1000)
0.090107917785644531
>>> timeenc("latin-1", 1000)
0.011034011840820312
Shouldn't the tests for common encodings in PyUnicode_AsEncodedString and
PyUnicode_Decode use strcasecmp?
Skip
More information about the Python-list
mailing list