[issue22746] cgitb html: wrong encoding for utf-8

Tue Oct 28 05:21:14 CET 2014

Wolfgang Rohdewald added the comment:

If you cannot offer a solution for arbitrary unicode, you have no solution at all. Afer all, that is what unicode is about: support ALL languages, not only your own.

I do not quite understand why you think this is not a bug.

If cgitb encodes unicode like & x e 4 ; (remove spaces), the browser does not have to guess the encoding, it will always show the correct character. This works for all of unicode. See https://en.wikipedia.org/wiki/Unicode_and_HTML#Numeric_character_references

So this bug is fixable, I am reopening it.

For Python3, the fix is actually very simple: Do not write doc but str(doc.encode('ascii', 'xmlcharrefreplace')), like in the attached patch. This patch works for me but there might be yet uncovered code paths. And my source file is encoded in utf-8, other source file encodings should be tested too. I do not know if cgitb correctly honors the source file header like # -*- coding: utf-8 -*-

Fixing this for Python2 is certainly doable too but perhaps more difficult because a Python2 str() may have an unknown encoding.

----------
keywords: +patch
resolution: not a bug -> 
status: closed -> open
Added file: http://bugs.python.org/file37047/22746.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue22746>
_______________________________________