Interpreting non-ascii characters.

John Machin sjmachin at lexicon.net
Tue Jul 17 18:29:58 EDT 2007


On 18/07/2007 4:11 AM, ddtl wrote:
> Hello everybody,
> 
> I want to create a script which reads files in a
> current directory and renames them according to some
> scheme. The file names are in Russian - sometimes 
> the names encoded as win-1251, sometimes as koi8-r etc. 

You have a file system with 8-bit file names with no indication of 
'codepage' or 'encoding', either globally or per file? Which operating 
system are you using?

> I want to read in file name and convert it to list for 
> further processing.

Read file name from a text file? Or do you mean using e.g. glob.glob() 
or os.listdir()

What do you mean by "convert it to list"? Do you mean 'foo.txt' -> ['f', 
'o', ....etc]??? Why?

>  The problem is that Python treats 
> non-ascii characters as multibyte characters - for 
> example, hex code for "Small Character A" in koi8-r is 
> 0xc1, but Python interprets it as a sequence of 
> \xd0, \xb1 bytes.

Python is very unlikely to do that all by itself. Please show us the 
script or whatever evidence you have. I strongly suggest that 
immediately after "reading" a file name, you do
     print repr(file_name)
NOT
     print file_name
so that you can see *exactly* what you've got.

Are you sure about the \xb1??? Consider this:

 >>> '\xc1'.decode('koi8-r')
u'\u0430'
 >>> '\xc1'.decode('koi8-r').encode('utf8')
'\xd0\xb0'
 >>>

Also:
 >>> import sys; sys.stdout.encoding
'cp850' # Win XP Pro, command prompt
 >>>
What do you get when you do that?

> 
> What can I do so that Python interprets non-ascii 
> characters correctly?

Know how your non-ascii characters are encoded. Tell Python what to do 
with them.

Read this:
http://www.amk.ca/python/howto/unicode

Hope this helps,
John



More information about the Python-list mailing list