Unicode support in python

Diez B. Roggisch deets at nospam.web.de
Fri Oct 20 09:06:56 EDT 2006


sonald schrieb:
> Fredrik Lundh wrote:
>>>    http://www.google.com/search?q=python+unicode
>> (and before anyone starts screaming about how they hate RTFM replies, look
>> at the search result)
>>
>> </F>
> Thanks!! but i have already tried this...

Tried - might be. But you certainly didn't understand it. So I suggest 
that you read it again.

> and let me tell you what i am trying now...
> 
> I have added the following line in the script
> 
> # -*- coding: utf-8 -*-

This will _only_ affect unicode literals inside the script itself - 
nothing else! No files read, no files written, and additionally the path 
of sun, earth and moon are unaffected as well - just in case you wondered.

This is an example of what is affected now:


--------
# -*- coding: utf-8 -*-
# this string is a byte string. it is created as such,
# regardless of the above encoding. instead, only
# what is in the bytes of the file itself is taken into account
some_string = "büchsenböller"

# this is a unicode literal (note the leading u).
# it will be _decoded_ using the above
# mentioned encoding. So make sure, your file is written in the
# proper encoding
some_unicode_object = u"büchsenböller"
---------




> I have also modified the site.py in ./Python24/Lib as
> def setencoding():
>     """Set the string encoding used by the Unicode implementation.  The
>     default is 'ascii', but if you're willing to experiment, you can
>     change this."""
>     encoding = "utf-8" # Default value set by _PyUnicode_Init()
>     if 0:
>         # Enable to support locale aware default string encodings.
>         import locale
>         loc = locale.getdefaultlocale()
>         if loc[1]:
>             encoding = loc[1]
>     if 0:
>         # Enable to switch off string to Unicode coercion and implicit
>         # Unicode to string conversion.
>         encoding = "undefined"
>     if encoding != "ascii":
>         # On Non-Unicode builds this will raise an AttributeError...
>         sys.setdefaultencoding(encoding) # Needs Python Unicode build !
> 
> Now when I try to validate the data in the text file
> say abc.txt (saved as with utf-8 encoding) containing either english or
> russian text,
> 
> some junk character (box like) is added as the first character
> what must be the reason for this?
> and how do I handle it?

You shouldn't tamper with the site-wide encoding, as this will mask 
errors you made in the best case, let alone not producing new ones.

And what do you think it would help you anyway? Pythons unicode support 
would be stupid to say the least if it required the installation changed 
before dealing with files of different encodings - don't you think?

As you don't show us the code you actually use to read that file, I'm 
down to guessing here, but if you just open it as binary file with

content = open("test.txt").read()

there won't be any magic decoding happening.

What you need to do instead is this (if you happen to know that test.txt 
is encoded in utf-8):

content = open("test.txt").read().decode("utf-8")


Then you have a unicode object. Now if you need that to be written to a 
terminal (or wherever your "boxes" appear - guessing here too, no code, 
you remember?), you need to make sure that

  - you know the terminals encoding

  - you properly endcode the unicode content to that encoding before 
printing, as otherwise the default-encoding will be used


So, in case your terminal uses utf-8, you do

print content.encode("utf-8")


Diez



More information about the Python-list mailing list