Capturing the bad codes that raise UnicodeError exceptions during decoding

Thu Aug 4 16:11:26 EDT 2016

On 2016-08-04 15:45, Random832 wrote:
> On Thu, Aug 4, 2016, at 15:22, Malcolm Greene wrote:
>> Hi Chris,
>>
>> Thanks for your suggestions. I would like to capture the specific bad
>> codes *before* they get replaced. So if a line of text has 10 bad codes
>> (each one raising UnicodeError), I would like to track each exception's
>> bad code but still return a valid decode line when finished. 
> Look into writing your own error handler - there's enough information
> provided to do this.
>
> https://docs.python.org/3/library/codecs.html
You could also use the 'surrogateescape' error handler, and count the
number of high surrogate characters (each of which represents a byte
that couldn't be decoded under the specified encoding). This will give
you a text string, which you can then process to replace code points in
the range U+DC80 - U+DCFF (inclusive).

"""
In [1]: bad_byte_string = b'Some \xf9 odd \x84 bytes \xc2 here'

In [2]: decoded = bad_byte_string.decode(errors='surrogateescape')

In [3]: decoded
Out[3]: 'Some \udcf9 odd \udc84 bytes \udcc2 here'

In [4]: high_surrogate_range = range(0xdc80, 0xdd00)

In [5]: sum(ord(char) in high_surrogate_range for char in decoded)
Out[5]: 3

In [6]: from collections import Counter

In [7]: from typing import Iterable

In [8]: def get_bad_bytes(string: str) -> Iterable[bytes]:
    ...:     for char in string:
    ...:         if ord(char) in high_surrogate_range:
    ...:             yield char.encode(errors='surrogateescape')
    ...:

In [9]: bad_byte_counts = Counter(get_bad_bytes(decoded))

In [10]: bad_byte_counts
Out[10]: Counter({b'\x84': 1, b'\xc2': 1, b'\xf9': 1})
"""

MMR...