Ascii Encoding Error with UTF-8 encoder

Tue Jun 27 19:44:28 EDT 2006

Thanks for the thorough explanation.

What I am doing is converting data for processing that will be tab (for 
columns) and newline (for row) delimited.   Some of the data contains tabs 
and newlines so, I have to convert them to something else so the file 
integrity is good.

Not my idea, I've been left with the implementation however.

"John Machin" <sjmachin at lexicon.net> wrote in message 
news:44a1bbcb$1 at news.eftel.com...
> On 28/06/2006 7:46 AM, Mike Currie wrote:
>> Can anyone explain why I'm getting an ascii encoding error when I'm 
>> trying to write out using a UTF-8 encoder?
>>
>
>>>>> f = codecs.open('foo.txt', 'wU', 'utf-8')
>>>>> print filteredLine
>> thisêhasêàtabsêandêlineàbreaks
>>>>> f.write(filteredLine)
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in ?
>>   File "C:\Python24\lib\codecs.py", line 501, in write
>>     return self.writer.write(data)
>>   File "C:\Python24\lib\codecs.py", line 178, in write
>>     data, consumed = self.encode(object, self.errors)
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
>> ordinal
>> not in range(128)
>>
>
> Your fundamental problem is that you are trying to decode an 8-bit string 
> to UTF-8. The codec tries to convert your string to Unicode first, using 
> the default encoding (ascii), which fails.
>
> Get this into your head:
> You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever 
> into an 8-bit string.
> You decode whatever from an 8-bit string into Unicode.
>
> Here is a run-down on your problem, using just the encode/decode methods 
> instead of codecs for illustration purposes:
>
> (1) Equivalent to what you did.
> |>> '\x88'.encode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0: 
> ordinal not in range(128)
>
> (2) Same thing, explicitly trying to decode your 8-bit string as ASCII.
> |>> '\x88'.decode('ascii').encode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0: 
> ordinal not in range(128)
>
> (3) Encoding Unicode as UTF-8 works, as expected.
> |>> u'\x88'.encode('utf-8')
> '\xc2\x88'
>
> (4) But you need to know what your 8-bit data is supposed to be encoded 
> in, before you start.
> |>> '\x88'.decode('cp1252').encode('utf-8')
> '\xcb\x86'
> |>> '\x88'.decode('latin1').encode('utf-8')
> '\xc2\x88'
>
> I am rather puzzled as to what you are trying to achieve. You appear to 
> believe that you possess one or more 8-bit strings, encoded in latin1, 
> which contain the C0 controls \x09 (HT) and \x0a (LF) AND the 
> corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change LF 
> to NEL, and NEL to LF and similarly with the other pair. Then you want to 
> write the result, encoded in UTF-8, to a file. The purpose behind that 
> baroque/byzantine capering would be .... what?
>