codecs.StreamRecoder not doing what I expected.

Peter Otten __peter__ at web.de
Sat Dec 12 15:35:36 EST 2015


D'Arcy J.M. Cain wrote:

> More Unicode bafflement.  What I am trying to do is pretty simple I
> think.  I have a bunch of files that I am pretty sure are either utf-8
> or iso-8859-1.  I try utf-8 and fall back to iso-8859-1 if it throws a
> UnicodeError.  Here is my test.
> 
> #! /usr/pkg/bin/python3.4
> # Running on a NetBSD 7.0 server
> # Installed with pkgsrc
> 
> import codecs
> test_file = "StreamRecoder.txt"
> 
> def read_file(fn):
>     try: return open(fn, "r", encoding='utf-8').read()
>     except UnicodeError:
>         return codecs.StreamRecoder(open(fn),

A recoder converts bytes to bytes, so you have to open the file in binary 
mode. However, ...

>             codecs.getencoder('utf-8'),
>             codecs.getdecoder('utf-8'),
>             codecs.getreader('iso-8859-1'),
>             codecs.getwriter('iso-8859-1'), "r").read()
> 
> # plain ASCII
> open(test_file, 'wb').write(b'abc - cents\n')
> print(read_file(test_file))
> 
> # utf-8
> open(test_file, 'wb').write(b'abc - \xc2\xa2\n')
> print(read_file(test_file))
> 
> # iso-8859-1
> open(test_file, 'wb').write(b'abc - \xa2\n')
> print(read_file(test_file))

...when the recoder kicks in read_file() will return bytes which is probably 
not what you want. Why not just try the two encodings as in

def read_file(filename):
    for encoding in ["utf-8", "iso-8859-1"]:
        try:
            with open(filename, encoding=encoding) as f:
                return f.read()
        except UnicodeDecodeError:
            pass
    raise AssertionError("unreachable")

> 
> I expected all three to return UTF-8 strings but here is my output:
> 
> abc - cents
> 
> abc - ¢
> 
> Traceback (most recent call last):
>   File "./StreamRecoder_test", line 9, in read_file
>     try: return open(fn, "r", encoding='utf-8').read()
>   File "/usr/pkg/lib/python3.4/codecs.py", line 319, in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final)
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 6:
> invalid start byte
> 
> During handling of the above exception, another exception occurred:
> 
> Traceback (most recent call last):
>   File "./StreamRecoder_test", line 27, in <module>
>     print(read_file(test_file))
>   File "./StreamRecoder_test", line 15, in read_file
>     codecs.getwriter('iso-8859-1'), "r").read()
>   File "/usr/pkg/lib/python3.4/codecs.py", line 798, in read
>     data = self.reader.read(size)
>   File "/usr/pkg/lib/python3.4/codecs.py", line 489, in read
>     newdata = self.stream.read()
>   File "/usr/pkg/lib/python3.4/encodings/ascii.py", line 26, in decode
>     return codecs.ascii_decode(input, self.errors)[0]
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xa2 in position 6:
> ordinal not in range(128)
> 





More information about the Python-list mailing list