Python UTF-8 and codecs

Tue Jun 27 17:22:34 EDT 2006

Okay,

Here is a sample of what I'm doing:

Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on 
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> filterMap = {}
>>> for i in range(0,255):
...     filterMap[chr(i)] = chr(i)
...
>>> filterMap[chr(9)] = chr(136)
>>> filterMap[chr(10)] = chr(133)
>>> filterMap[chr(136)] = chr(9)
>>> filterMap[chr(133)] = chr(10)
>>> line = '''this      has
... tabs        and     line
... breaks'''
>>> filteredLine = ''.join([ filterMap[a] for a in line])
>>> import codecs
>>> f = codecs.open('foo.txt', 'wU', 'utf-8')
>>> print filteredLine
thisêhasêàtabsêandêlineàbreaks
>>> f.write(filteredLine)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "C:\Python24\lib\codecs.py", line 501, in write
    return self.writer.write(data)
  File "C:\Python24\lib\codecs.py", line 178, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4: 
ordinal
not in range(128)
>>>

"Mike Currie" <dev at null.com> wrote in message 
news:5Hgog.627$Gv.173 at fed1read09...
>I did make a mistake, it should have been 'wU'.
>
> The starting data is ASCII.
>
> What I'm doing is data processing on files with new line and tab 
> characters inside quoted fields.  The idea is to convert all the new line 
> and characters to 0x85 and 0x88 respectivly, then process the files. 
> Finally right before importing them into a database convert them back to 
> new line and tab's thus preserving the field values.
>
> Will python not handle the control characters correctly?
>
>
> "Serge Orlov" <serge.orlov at gmail.com> wrote in message 
> news:mailman.7516.1151440194.27775.python-list at python.org...
>> On 6/27/06, Mike Currie <dev at null.com> wrote:
>>> I'm trying to write out files that have utf-8 characters 0x85 and 0x08 
>>> in
>>> them.  Every configuration I try I get a UnicodeError: ascii codec can't
>>> decode byte 0x85 in position 255: oridinal not in range(128)
>>>
>>> I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', 
>>> errors='strict')
>>> and that doesn't work and I've also try wrapping the file in an 
>>> utf8_writer
>>> using codecs.lookup('utf8')
>>>
>>> Any clues?
>>
>> Use unicode strings for non-ascii characters. The following program 
>> "works":
>>
>> import codecs
>>
>> c1 = unichr(0x85)
>> f = codecs.open('foo.txt', 'wU', 'utf-8')
>> f.write(c1)
>> f.close()
>>
>> But unichr(0x85) is a control characters, are you sure you want it?
>> What is the encoding of your data?
>
>