codecs, csv issues

Fri Aug 22 11:44:50 EDT 2008

George Sakkis wrote:

> I'm trying to use codecs.open() and I see two issues when I pass
> encoding='utf8':
> 
> 1) Newlines are hardcoded to LINEFEED (ascii 10) instead of the
> platform-specific byte(s).
> 
>     import codecs
>     f = codecs.open('tmp.txt', 'w', encoding='utf8')
>     s = u'\u0391\u03b8\u03ae\u03bd\u03b1'
>     print >> f, s
>     print >> f, s
>     f.close()
> 
> This doesn't happen for the default encoding (=None).
> 
> 2) csv.writer doesn't seem to work as expected when being passed a
> codecs object; it treats it as if encoding is ascii:
> 
>     import codecs, csv
>     f = codecs.open('tmp.txt', 'w', encoding='utf8')
>     s = u'\u0391\u03b8\u03ae\u03bd\u03b1'
>     # this works fine
>     print >> f, s
>     # this doesn't
>     csv.writer(f).writerow([s])
>     f.close()
> 
> Traceback (most recent call last):
> ...
>     csv.writer(f).writerow([s])
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u0391' in
> position 0: ordinal not in range(128)
> 
> Is this the expected behavior or are these bugs ?

Looking into the documentation

"""
Note: This version of the csv module doesn't support Unicode input. Also,
there are currently some issues regarding ASCII NUL characters.
Accordingly, all input should be UTF-8 or printable ASCII to be safe; see
the examples in section 9.1.5. These restrictions will be removed in the
future. 
"""

and into the source code

    if encoding is not None and \
       'b' not in mode:
        # Force opening of the file in binary mode
        mode = mode + 'b'

I'd be willing to say that both are implementation limitations.

Peter