catch UnicodeDecodeError

Thu Aug 30 01:50:04 EDT 2012

On 7/26/2012 5:51 AM, Jaroslav Dobrek wrote:
>> And the cool thing is: you can! :)
>>
>> In Python 2.6 and later, the new Py3 open() function is a bit more hidden,
>> but it's still available:
>>
>>      from io import open
>>
>>      filename = "somefile.txt"
>>      try:
>>          with open(filename, encoding="utf-8") as f:
>>              for line in f:
>>                  process_line(line)  # actually, I'd use "process_file(f)"
>>      except IOError, e:
>>          print("Reading file %s failed: %s" % (filename, e))
>>      except UnicodeDecodeError, e:
>>          print("Some error occurred decoding file %s: %s" % (filename, e))
>
> Thanks. I might use this in the future.
>
>>> try:
>>>      for line in f: # here text is decoded implicitly
>>>         do_something()
>>> except UnicodeDecodeError():
>>>      do_something_different()
>>
>>> This isn't possible for syntactic reasons.
>>
>> Well, you'd normally want to leave out the parentheses after the exception
>> type, but otherwise, that's perfectly valid Python code. That's how these
>> things work.
>
> You are right. Of course this is syntactically possible. I was too
> rash, sorry. In confused
> it with some other construction I once tried. I can't remember it
> right now.
>
> But the code above (without the brackets) is semantically bad: The
> exception is not caught.
>
>
>>> The problem is that vast majority of the thousands of files that I
>>> process are correctly encoded. But then, suddenly, there is a bad
>>> character in a new file. (This is so because most files today are
>>> generated by people who don't know that there is such a thing as
>>> encodings.) And then I need to rewrite my very complex program just
>>> because of one single character in one single file.
>>
>> Why would that be the case? The places to change should be very local in
>> your code.
>
> This is the case in a program that has many different functions which
> open and parse different
> types of files. When I read and parse a directory with such different
> types of files, a program that
> uses
>
> for line in f:
>
> will not exit with any hint as to where the error occurred. I just
> exits with a UnicodeDecodeError. That
> means I have to look at all functions that have some variant of
>
> for line in f:
>
> in them. And it is not sufficient to replace the "for line in f" part.
> I would have to transform many functions that
> work in terms of lines into functions that work in terms of decoded
> bytes.
>
> That is why I usually solve the problem by moving fles around until I
> find the bad file. Then I recode or repair
> the bad file manually.

Would it be reasonable to use pieces of the old program to write a
new program that prints the name for an input file, then searches
that input file for bad characters?  If it doesn't find any, it can
then go on to the next input file, or show a message saying that no
bad characters were found.