Ascii Encoding Error with UTF-8 encoder
Mike Currie
dev at null.com
Tue Jun 27 19:44:28 EDT 2006
Thanks for the thorough explanation.
What I am doing is converting data for processing that will be tab (for
columns) and newline (for row) delimited. Some of the data contains tabs
and newlines so, I have to convert them to something else so the file
integrity is good.
Not my idea, I've been left with the implementation however.
"John Machin" <sjmachin at lexicon.net> wrote in message
news:44a1bbcb$1 at news.eftel.com...
> On 28/06/2006 7:46 AM, Mike Currie wrote:
>> Can anyone explain why I'm getting an ascii encoding error when I'm
>> trying to write out using a UTF-8 encoder?
>>
>
>>>>> f = codecs.open('foo.txt', 'wU', 'utf-8')
>>>>> print filteredLine
>> thisêhasêàtabsêandêlineàbreaks
>>>>> f.write(filteredLine)
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in ?
>> File "C:\Python24\lib\codecs.py", line 501, in write
>> return self.writer.write(data)
>> File "C:\Python24\lib\codecs.py", line 178, in write
>> data, consumed = self.encode(object, self.errors)
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
>> ordinal
>> not in range(128)
>>
>
> Your fundamental problem is that you are trying to decode an 8-bit string
> to UTF-8. The codec tries to convert your string to Unicode first, using
> the default encoding (ascii), which fails.
>
> Get this into your head:
> You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever
> into an 8-bit string.
> You decode whatever from an 8-bit string into Unicode.
>
> Here is a run-down on your problem, using just the encode/decode methods
> instead of codecs for illustration purposes:
>
> (1) Equivalent to what you did.
> |>> '\x88'.encode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
> ordinal not in range(128)
>
> (2) Same thing, explicitly trying to decode your 8-bit string as ASCII.
> |>> '\x88'.decode('ascii').encode('utf-8')
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
> ordinal not in range(128)
>
> (3) Encoding Unicode as UTF-8 works, as expected.
> |>> u'\x88'.encode('utf-8')
> '\xc2\x88'
>
> (4) But you need to know what your 8-bit data is supposed to be encoded
> in, before you start.
> |>> '\x88'.decode('cp1252').encode('utf-8')
> '\xcb\x86'
> |>> '\x88'.decode('latin1').encode('utf-8')
> '\xc2\x88'
>
> I am rather puzzled as to what you are trying to achieve. You appear to
> believe that you possess one or more 8-bit strings, encoded in latin1,
> which contain the C0 controls \x09 (HT) and \x0a (LF) AND the
> corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change LF
> to NEL, and NEL to LF and similarly with the other pair. Then you want to
> write the result, encoded in UTF-8, to a file. The purpose behind that
> baroque/byzantine capering would be .... what?
>
More information about the Python-list
mailing list