[Python-Dev] codecs question

M.-A. Lemburg mal@lemburg.com
Fri, 29 Sep 2000 20:09:13 +0200


"Fred L. Drake, Jr." wrote:
> 
>   Jeremy was just playing with the xml.sax package, and decided to
> print the string returned from parsing "û" (the copyright
> symbol).  Sure enough, he got a traceback:
> 
> >>> print u'\251'
> 
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: ASCII encoding error: ordinal not in range(128)
> 
> and asked me about it.  I was a little surprised myself.  First, that
> anyone would use "print" in a SAX handler to start with, and second,
> that it was so painful.

That's a consequence of defaulting to ASCII for all platforms
instead of choosing the encoding depending on the current locale
(the site.py file has code which does the latter).

>   Now, I can chalk this up to not using a reasonable stdout that
> understands that Unicode needs to be translated to Latin-1 given my
> font selection.  So I looked at the codecs module to provide a usable
> output stream.  The EncodedFile class provides a nice wrapper around
> another file object, and supports both encoding both ways.
>   Unfortunately, I can't see what "encoding" I should use if I want to
> read & write Unicode string objects to it.  ;(  (Marc-Andre, please
> tell me I've missed something!) 

That depends on what you want to see as output ;-) E.g. in
Europe you'd use Latin-1 (which also contains the copyright
symbol).

> I also don't think I
> can use it with "print", extended or otherwise.
>   The PRINT_ITEM opcode calls PyFile_WriteObject() with whatever it
> gets, so that's fine.  Then it converts the object using
> PyObject_Str() or PyObject_Repr().  For Unicode objects, the tp_str
> handler attempts conversion to the default encoding ("ascii" in this
> case), and raises the traceback we see above.

Right.

>   Perhaps a little extra work is needed in PyFile_WriteObject() to
> allow Unicode objects to pass through if the file is merely file-like,
> and let the next layer handle the conversion?  This would probably
> break code, and therefore not be acceptable.
>   On the other hand, it's annoying that I can't create a file-object
> that takes Unicode strings from "print", and doesn't seem intuitive.

The problem is that the .write() method of a file-like object
will most probably only work with string objects. If
it uses "s#" or "t#" it's lucky, because then the argument
parser will apply the necessariy magic to the input object
to get out some object ready for writing to the file. Otherwise
it will simply fail with a type error.

Simply allowing PyObject_Str() to return Unicode objects too
is not an alternative either since that would certainly break
tons of code.

Implementing tp_print for Unicode wouldn't get us anything
either.

Perhaps we'll need to fix PyFile_WriteObject() to special
case Unicode and allow calling .write() with an Unicode
object and fix those .write() methods which don't do the
right thing ?!

This is a project for 2.1. In 2.0 only explicitly calling
the .write() method will do the trick and EncodedFile()
helps with this.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/