(Fucking) Unicode: console print statement and PythonWin: replacement for off-table chars HOWTO?

Robert kxroberto at googlemail.com
Tue Jan 10 14:28:07 EST 2006


(windows or linux console)

>>> print u'\u034a'
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "C:\PYTHON23\lib\encodings\cp850.py", line 18, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u034a' in
position
 0: character maps to <undefined>
>>>

How to get a replacement behaviour into Python's print statement
generally ?

Fumble on sys.stdout/stderr?  sys.stdout.write(u) puts at least random
chars. Thus print seems to do it itself and obviously gets
sys.stdout.encoding and encodes 'strict'.  Where is a good and portable
chance for hooking?
E.g. for doing it similar as .encode(xy,'replace') or
'backslashreplace'?

Shouldn't 'replace' be the default behaviour for (tty-)output !?

Background: my file handling script fails on consoles not supporting
all filenamechars. I want my apps to auto-run on each platform as
smooth, smart and tolerant as possible without fumbling on hundreds and
thousands of print/output statements. (input is an extra issue of
course)

2nd Problem with PythonWin output functions: PythonWin/win32 functions
(which obviously do not support wide unicode auto or by xxxW functions)
obviously use the python default encoding, but try a defaultlocale
before (defaultlocale, then 'ascii'/site.encoding then error exception
by occasion!).
This can only be made tolerant on alien chars by hacking
site.py/sitecustomize.py/encoding (very sad about this on each python
installation).
Or is there a Pythonwin function to set the encoding?
sys.setdefaultencoding is completely destroyed - not even preserved as
sys._setdefaultencoding or so.
 (to 'mbcs' - not defaultlocale (cp1252 on my machine), because only
mbcs is tolerant on foreign chars and converts them to '?' )

The PythonWin scintilla-editor/interactive (obviously) is better: it
obviously uses  'mbcs' always.

I now decided to put 'mbcs' in site.py for Windows. Isn't that by far
the best and acceptable default solution. 'utf-8' in site.py would be
acceptable to get some idea about alien chars, but will

Thus on my Python/Pythonwin Windows default installation 4 encodings
are in action simultaneously !!!! :
* 'ascii' in site.py / str()
* 'mbcs' in PythonWin interactive/editor
* 'cp1252'+'ascii' in PythonWin/win32 Output functions
* 'cp850'  at console output
.. and all output is intolerant on alien chars ! (except 'mbcs' on the
primary _test_ field PythonWin Interactive only!! :-( )
Isn't that designed by the Python creators to drive developers crazy?

Now by setting site.py/encoding to 'mbcs' (or 'utf-8') the problems in
PythonWin are solved slightly.  But so far I have no idea, how to have
mbcs-output if chars existing and utf-8 or backslashreplace if
non-existing.
Also: Is wide unicode output possible somehow with PythonWin - at least
in certain cases? by WM_SETTEXT ,...SETITEM ... tricks?

On Linux  there is some improvement after setting
site.py/encoding='utf-8'. Still the locale sensitive encoding on tty's
should be tolerant/replace-mode by default.

Robert

PS:

this guy also is somewhat angry about the current situation:
http://blog.ianbicking.org/do-i-hate-unicode-or-do-i-hate-ascii.html

GvR felt save with 'ascii' for "future improvements" like utf-8 :
http://mail.python.org/pipermail/python-dev/2002-March/020962.html

My suggestions:
* Win/Linux: guess at least 'mbcs' on Win and 'utf-8' on Linux for
site.encoding are by far worth to do the improvement step. Or provide a
prominent function (not fragile sitexxxx.py interface) to change. The
current solution it is very unportable und requires very long time to
understand for new programmers)
And/Or:  making tty-print somehow tolerant/char-replacing.
* PythonWin: always use 'mbcs' als default-encoding in win32-functions
(mbcs_encode is tolerant/replacing in itself). or make the encoding
tolerant/char-replacing.
And:  Add xxxW-Functions or even automatic unicode switching for the
major output functions (SetWindowText, SetItem, DrawText,  ....)




More information about the Python-list mailing list