[Tutor] Unicode issues (was: UnicodeDecodeError)

Thu Feb 24 11:27:13 CET 2005

On Wed, 23 Feb 2005 23:16:20 -0500
Kent Johnson <kent37 at tds.net> wrote:

> How about
>    n = self.nextfile
>    if not isinstance(n, unicode):
>      n = unicode(n, 'iso8859-1')
> ?
> 
> > At least this might explain why "A\xe4" worked and "\xe4" not as I mentioned in a previous post.
> > Now the problem arises how to determine if self.nextfile is unicode or a byte string?
> > Or maybe even better, make sure that self.nextfile is always a byte string so I can safely convert
> > it to unicode later on. But how to convert unicode user input into byte strings when I don't even
> > know the user's encoding ? I guess this will require some further research.
> 
> Why do you need to convert back to byte strings?
> 
> You can find out the console encoding from sys.stdin and stdout:
>   >>> import sys
>   >>> sys.stdout.encoding
> 'cp437'
>   >>> sys.stdin.encoding
> 'cp437'
> 

I *thought* I would have to convert the user input which might be any encoding back into
byte string first (remember, I got heavily confused, because user input was sometimes unicode and
sometimes byte string), so I can convert it to "standard" unicode (utf-8) later on.
I've added this test to the file selection method, where "result" holds the filename the user chose:

    if isinstance(result, unicode):
        result = result.encode('iso8859-1')
    return result

later on self.nextfile is set to "result" .

The idea was, if I could catch the user's encoding, I could do something like:

    if isinstance(result, unicode):
        result = result.encode(sys.stdin.encoding)
    result = unicode(result, 'utf-8')

to avoid problems with unicode objects that have different encodings - or isn't this necessary at all ?

I'm sorry if this is a dumb question, but I'm afraid I'm a complete encoding-idiot.

Thanks and best regards

Michael