Pyhon 2.x or 3.x, which is faster?

Wed Mar 9 10:58:34 EST 2016

On Thu, Mar 10, 2016 at 2:33 AM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Thu, 10 Mar 2016 01:54 am, Chris Angelico wrote:
>
>> I have a source of occasional text files that basically just dumps
>> stuff on me without any metadata, and I have to figure out (a) what
>> the encoding is, and (b) what language the text is in.
>
> https://pypi.python.org/pypi/chardet
>
>> then I have two levels of heuristics to try to guess a
>> most-likely encoding
>
> I'm curious, what do you do?

Collect subtitles files from random internet contributors and
determine whether they add to the existing corpus of material. The
first heuristic level is chardet, as mentioned; but with the specific
files that I'm processing, it has some semi-consistent errors, so I
scripted around that - eg "if chardet says ISO-8859-2, and these byte
patterns exist, it's probably actually codepage 1250". IIRC the second
level is entirely translating from an ISO-8859 to the
nearest-equivalent Windows codepage.

> (I stress that trying to guess the character set or encoding from the text
> itself is a second-last ditch tactic, for when you really don't know and
> can't find out what the encoding is. The final, last-ditch tactic is to
> just say "bugger it, I'll pretend it's Latin-1" and get a mess of
> moji-bake, but at least an ASCII characters will decode alright, and as an
> English speaker, that's all that's important to me :-)

What I do is attempt to guess, *and then hand it to the user*. I have
a little "cdless" script that does a chardet on a file, decodes
accordingly, and pipes the result into 'less' [1]. The most powerful
character encoding detection tool in my arsenal is 'less'.

Pretending that text is Latin-1 is actually a pretty good start. If I
didn't have chardet, I'd be mainly using this:

https://github.com/Rosuav/shed/blob/master/charconv.py

With no args, this will take the beginning of the file (it tries to
get one paragraph of up to 1KB) and decode it using all the ISO-8859-*
encodings, displaying the results for human analysis. That's
surprisingly effective for a manual job. A large number of European
languages use a lot of ASCII letters and then each have their own
distinct non-ASCII characters in between; the only truly confusable
encodings are the ones that are entirely non-ASCII (Cyrillic, Arabic,
Greek, Hebrew - ISO-8859-5 through 8), and mis-decoding one as another
usually results in complete nonsense (words with impossible
vowel/consonant combinations, for instance). It does take *linguistic*
analysis (as opposed to purely mathematical/charcode), but it isn't
too hard.

ChrisA

[1] ... and since Unix pipes carry bytes, not text, this involves
encoding it as UTF-8. But that's an implementation detail between
cdless and less.