catch UnicodeDecodeError

Thu Jul 26 04:28:46 EDT 2012

Jaroslav Dobrek, 26.07.2012 09:46:
> My problem is solved. What I need to do is explicitly decode text when
> reading it. Then I can catch exceptions. I might do this in future
> programs.

Yes, that's the standard procedure. Decode on the way in, encode on the way
out, use Unicode everywhere in between.

> I dislike about this solution that it complicates most programs
> unnecessarily. In programs that open, read and process many files I
> don't want to explicitly decode and encode characters all the time. I
> just want to write:
> 
> for line in f:

And the cool thing is: you can! :)

In Python 2.6 and later, the new Py3 open() function is a bit more hidden,
but it's still available:

    from io import open

    filename = "somefile.txt"
    try:
        with open(filename, encoding="utf-8") as f:
            for line in f:
                process_line(line)  # actually, I'd use "process_file(f)"
    except IOError, e:
        print("Reading file %s failed: %s" % (filename, e))
    except UnicodeDecodeError, e:
        print("Some error occurred decoding file %s: %s" % (filename, e))

Ok, maybe with a better way to handle the errors than "print" ...

For older Python versions, you'd use "codecs.open()" instead. That's a bit
messy, but only because it was finally cleaned up for Python 3.

> or something like that. Yet, writing this means to *implicitly* decode
> text. And, because the decoding is implicit, you cannot say
> 
> try:
>     for line in f: # here text is decoded implicitly
>        do_something()
> except UnicodeDecodeError():
>     do_something_different()
> 
> This isn't possible for syntactic reasons.

Well, you'd normally want to leave out the parentheses after the exception
type, but otherwise, that's perfectly valid Python code. That's how these
things work.

> The problem is that vast majority of the thousands of files that I
> process are correctly encoded. But then, suddenly, there is a bad
> character in a new file. (This is so because most files today are
> generated by people who don't know that there is such a thing as
> encodings.) And then I need to rewrite my very complex program just
> because of one single character in one single file.

Why would that be the case? The places to change should be very local in
your code.

Stefan