Unicode drives me crazy...

Fuzzyman fuzzyman at gmail.com
Mon Jul 4 08:50:00 EDT 2005



fowlertrainer at citromail.hu wrote:
> Hi !
>
> I want to get the WMI infos from Windows machines.
> I use Py from HU (iso-8859-2) charset.
>
> Then I wrote some utility for it, because I want to write it to an XML file.
>
> def ToHU(s,NoneStr='-'):
>     if s==None: s=NoneStr
>     if not (type(s) in [type(''),type(u'')]):
>        s=str(s)
>     if type(s)<>type(u''):
>        s=unicode(s)
>     s=s.replace(chr(0),' ');
>     s=s.encode('iso-8859-2')
>     return s
>
> This fn is working, but I have been got an error with this value:
> 'Kommunik\xe1ci\xf3s port (COM1)'
>
> This routine demonstrates the problem
>
> s='Kommunik\xe1ci\xf3s port (COM1)'
> print s
> print type(s)
> print type(u'aaa')
> s=unicode(s) # error !
>
> This is makes me mad.
> How to I convert every objects to string, and convert (encode) them to
> iso-8859-2 (if needed) ?
>

s is a 'byte string' - a series of characters encoded in bytes. (As is
every string on some level). In order to convert that to a unicdoe
object, Python needs to know what encoding is used. In other words it
needs to know what character each byte represents.

See this :

t = s.decode('iso-8859-1')
t
u'Kommunik\xe1ci\xf3s port (COM1)'
print t
Kommunikációs port (COM1)
print type(s)
<type 'str'>
print type(t)
<type 'unicode'>

The decode instruction converts s into a unicode string - where Python
knows what every character is. If you call unicdoe with no encoding
specified, Python reverts to the system default - which is *probably*
'ascii'. You string contains characters which have *no meaning* in the
ascii codec - so it reports an error....

Does this help ?

Once you 'get unicode', Python support for it is pretty easy. It's a
slightly complicated subject though. Basically you need to *know* what
encoding is being used, and whenever you convert between unicode and
byte-strings you need to specify it.

What can complicate matters is that there are lot's of times an
*implicit* conversion can take place. Adding strings to unicode
objects, printing strings, or writing them to a file are the usual
times implicit conversion can happen. If you haven't specified an
encoding, then Python has to use the system default or the file object
default (sys.stdout often has a different default encoding than the one
returned by sys.getdefaultencoding()). It is these implicit conversions
that often cause the 'UnicodeDecodeError's and 'UnicodeEncodeError's.

HTH

Best Regards,

Fuzzy
http://www.voidspace.org.uk/python

> Please help me !
> 
> Thanx for help:
>  ft




More information about the Python-list mailing list