Ascii Encoding Error with UTF-8 encoder
John Machin
sjmachin at lexicon.net
Tue Jun 27 19:14:21 EDT 2006
On 28/06/2006 7:46 AM, Mike Currie wrote:
> Can anyone explain why I'm getting an ascii encoding error when I'm trying
> to write out using a UTF-8 encoder?
>
>>>> f = codecs.open('foo.txt', 'wU', 'utf-8')
>>>> print filteredLine
> thisêhasêàtabsêandêlineàbreaks
>>>> f.write(filteredLine)
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "C:\Python24\lib\codecs.py", line 501, in write
> return self.writer.write(data)
> File "C:\Python24\lib\codecs.py", line 178, in write
> data, consumed = self.encode(object, self.errors)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
> ordinal
> not in range(128)
>
Your fundamental problem is that you are trying to decode an 8-bit
string to UTF-8. The codec tries to convert your string to Unicode
first, using the default encoding (ascii), which fails.
Get this into your head:
You encode Unicode as ascii, latin1, cp1252, utf8, gagolitic, whatever
into an 8-bit string.
You decode whatever from an 8-bit string into Unicode.
Here is a run-down on your problem, using just the encode/decode methods
instead of codecs for illustration purposes:
(1) Equivalent to what you did.
|>> '\x88'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)
(2) Same thing, explicitly trying to decode your 8-bit string as ASCII.
|>> '\x88'.decode('ascii').encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 0:
ordinal not in range(128)
(3) Encoding Unicode as UTF-8 works, as expected.
|>> u'\x88'.encode('utf-8')
'\xc2\x88'
(4) But you need to know what your 8-bit data is supposed to be encoded
in, before you start.
|>> '\x88'.decode('cp1252').encode('utf-8')
'\xcb\x86'
|>> '\x88'.decode('latin1').encode('utf-8')
'\xc2\x88'
I am rather puzzled as to what you are trying to achieve. You appear to
believe that you possess one or more 8-bit strings, encoded in latin1,
which contain the C0 controls \x09 (HT) and \x0a (LF) AND the
corresponding C1 controls \x88 (HTS) and \x85 (NEL). You want to change
LF to NEL, and NEL to LF and similarly with the other pair. Then you
want to write the result, encoded in UTF-8, to a file. The purpose
behind that baroque/byzantine capering would be .... what?
More information about the Python-list
mailing list