Unicode support

Fri Aug 6 13:16:49 EDT 2004

Richy2004 wrote:

> code:
> import sys,codecs
> file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
> print (file.readline())
> 
> output:
> File "./test.py", line 5, in ?
> print (file.readline())
> File "C:\Python23\lib\codecs.py", line 384, in readline
> return self.reader.readline(size)
> File "c:\Python23\lib\encodings\utf_16.py", line 57, in readline
> raise NotImplementedError, '.readline() is not implemented for
> UTF-16'
> NotImplementedError: .readline() is not implemented for UTF-16
> 
> ======================================================
> code:
> import sys, codecs
> file = codecs.open("accountmgr_words_arb.txt", "r", "utf-16")
> print (file.read())
> 
> output:
> Traceback (most recent call last):
> File "./test.py", line 5, in ?
> print (file.read())
> File "c:\Python23\lib\encodings\cp850.py", line 18, in encode
> return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode characters in position
> 0-2: character maps to <undefined>
> 
> ======================================================
> code:
> import sys, codecs
> file = codecs.open("accountmgr_words_arb.txt", "rb", "utf-16")
> lines = file.readlines()
> print lines

> this works !, output:
> [u'\u0646\u0648\u0639 \u062d\u0633\u0627\u0628 \u062c\u062f\u064a\u062f
> \u0645\u062e\u062a\u0627\u0631.\r\n']

You understand this is just one line, and not multiple lines? Just 
checking. The reason why it works is that you are getting a 
representation of the list.

> line = lines[0]
> tokens = line.split("\\u")
This line doesn't make sense. Do you want to split up the line into a 
list of individual characters as in:
 >> tokens = list(lines[0])
 >> print tokens
[u'\u0646', u'\u0648', u'\u0639', u'\u062d', u'\u0633', u'\u0627', 
u'\u0628', u'\u062c', u'\u062f', u'\u064a', u'\u062f', u'\u0645', 
u'\u062e', u'\u062a', u'\u0627', u'\u0631', u'.', u'\r', u'\n']

> print tokens[0]
> 
> I get this: :(
> Traceback (most recent call last):
> File "./test.py", line 8, in ?
> print tokens[0]
> File "c:\Python23\lib\encodings\cp850.py", line 18, in encode
> return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeEncodeError: 'charmap' codec can't encode characters in position
> 0-2: character maps to <undefined>

Anyway, you are trying to print to the console window. AFAIK, Python 2.3 
guesses the console encoding, which in your case is cp850.py, and uses 
it as single- byte encoding to encode your unicode characters before 
writing them to stdout. Unfortunately, you cannot print which I believe 
are Arabic characters to a CP850 encoded console (as a matter of fact, 
you can't print any of the so-called 'complex scripts' to any windows 
console, but that is a different matter).

If you run the same script in a lets say, IDLE you won't have that 
problem. In other words, if you need to print these characters, you have 
to either print them as unicode characters to a unicode-savy output, or 
encode them  in an appropriate single-byte encoding (e.g. "cp1256") and 
output them to an output window that nows how to deal with it.

--
Vincent Wehren
> 
> Thanks,
> Richard
>