UTF-8 and stdin/stdout?

Chris cwitts at gmail.com
Wed May 28 05:22:04 EDT 2008


On May 28, 11:08 am, dave_140... at hotmail.com wrote:
> Hi,
>
> I have problems getting my Python code to work with UTF-8 encoding
> when reading from stdin / writing to stdout.
>
> Say I have a file, utf8_input, that contains a single character, é,
> coded as UTF-8:
>
>         $ hexdump -C utf8_input
>         00000000  c3 a9
>         00000002
>
> If I read this file by opening it in this Python script:
>
>         $ cat utf8_from_file.py
>         import codecs
>         file = codecs.open('utf8_input', encoding='utf-8')
>         data = file.read()
>         print "length of data =", len(data)
>
> everything goes well:
>
>         $ python utf8_from_file.py
>         length of data = 1
>
> The contents of utf8_input is one character coded as two bytes, so
> UTF-8 decoding is working here.
>
> Now, I would like to do the same with standard input. Of course, this:
>
>         $ cat utf8_from_stdin.py
>         import sys
>         data = sys.stdin.read()
>         print "length of data =", len(data)
>
> does not work:
>
>         $ [/c/DiskCopy] python utf8_from_stdin.py < utf8_input
>         length of data = 2
>
> Here, the contents of utf8_input is not interpreted as UTF-8, so
> Python believes there are two separate characters.
>
> The question, then:
> How could one get utf8_from_stdin.py to work properly with UTF-8?
> (And same question for stdout.)
>
> I googled around, and found rather complex stuff (see, for example,http://blog.ianbicking.org/illusive-setdefaultencoding.html), but even
> that didn't work: I still get "length of data = 2" even after
> successively calling sys.setdefaultencoding('utf-8').
>
> -- dave

weird thing is 'c3 a9' is é on my side... and copy/pasting the é
gives me 'e9' with the first script giving a result of zero and second
script gives me 1



More information about the Python-list mailing list