Interpreting non-ascii characters.
John Machin
sjmachin at lexicon.net
Tue Jul 17 18:29:58 EDT 2007
On 18/07/2007 4:11 AM, ddtl wrote:
> Hello everybody,
>
> I want to create a script which reads files in a
> current directory and renames them according to some
> scheme. The file names are in Russian - sometimes
> the names encoded as win-1251, sometimes as koi8-r etc.
You have a file system with 8-bit file names with no indication of
'codepage' or 'encoding', either globally or per file? Which operating
system are you using?
> I want to read in file name and convert it to list for
> further processing.
Read file name from a text file? Or do you mean using e.g. glob.glob()
or os.listdir()
What do you mean by "convert it to list"? Do you mean 'foo.txt' -> ['f',
'o', ....etc]??? Why?
> The problem is that Python treats
> non-ascii characters as multibyte characters - for
> example, hex code for "Small Character A" in koi8-r is
> 0xc1, but Python interprets it as a sequence of
> \xd0, \xb1 bytes.
Python is very unlikely to do that all by itself. Please show us the
script or whatever evidence you have. I strongly suggest that
immediately after "reading" a file name, you do
print repr(file_name)
NOT
print file_name
so that you can see *exactly* what you've got.
Are you sure about the \xb1??? Consider this:
>>> '\xc1'.decode('koi8-r')
u'\u0430'
>>> '\xc1'.decode('koi8-r').encode('utf8')
'\xd0\xb0'
>>>
Also:
>>> import sys; sys.stdout.encoding
'cp850' # Win XP Pro, command prompt
>>>
What do you get when you do that?
>
> What can I do so that Python interprets non-ascii
> characters correctly?
Know how your non-ascii characters are encoded. Tell Python what to do
with them.
Read this:
http://www.amk.ca/python/howto/unicode
Hope this helps,
John
More information about the Python-list
mailing list