Processing text data with different encodings

Tue Jun 28 06:25:44 EDT 2016

On Tue, Jun 28, 2016 at 6:30 PM, Peter Otten <__peter__ at web.de> wrote:
> Does chardet ever return an encoding that fails to decode
> the line? Only in that case the "ignore" error handler would make sense.

Assuming the module the OP is using is functionally identical to the
one I use from the command line (which is implemented in Python), yes
it can. Usually what happens is that it detects something as an
ISO-8859-* when it's actually the corresponding Windows codepage; if
you try to decode it that way, you end up with a handful of byte
values that don't correctly decode. I have a "cdless" command that
does a chardet, decodes the file, re-encodes as UTF-8, and pipes the
result into less(1); great way to figure out what encoding something
is (if it gets it wrong, it's usually really obvious to a human). It
has a magic second parameter "win" to switch from ISO-8859 to Windows
encoding - ISO-8859-1 becomes Windows-1252, -2 becomes 1250, etc.
Additionally, chardet often returns "MacCyrillic" for files that are
actually encoded Windows-1256 (Arabic). So, yes, it's definitely
possible for chardet to pick something that you can't actually decode
with.

For the OP's situation, frankly, I doubt there'll be anything other
than UTF-8, Latin-1, and CP-1252. The chances that someone casually
mixes CP-1252 with (say) CP-1254 would be vanishingly small. So the
simple decode of "UTF-8, or failing that, 1252" is probably going to
give correct results for most of the content. The trick is figuring
out a correct boundary for the check; line-by-line may be sufficient,
or it may not.

ChrisA