[Chicago] Chardet help

Martin Maney maney at two14.net
Sun Mar 10 16:51:39 CET 2013


On Sun, Mar 10, 2013 at 10:00:43AM -0500, Tathagata Dasgupta wrote:
> def getEncoding(infile):
> import chardet
> rawdata = open(infile, "r").read()
>  result = chardet.detect(rawdata)
> charenc = result['encoding']
> print charenc
> 
> That gives me ISO-8859-2.

That may be the problem.  Why would Italian text be encoded in the
Central European character set?  From a quick look at the raw data in
the browser, 8859-2 is obviously incorrect.  8859-1 looks better!  In
fact, it looks better that 8859-3, the Southern European variant.

Guessing what encoding a text is in is always a pain.  I don't know
what that chardet is, but from the results it appears to be less than
reliable.

Caveat: my guess is based on which encodings leave "unknown code point"
blobs and/or accent marks which I'm fairly sure Italian doesn't use. 
But I have no Italian, myself.

-- 
The dualist evades the frame problem - but only because
dualism draws the veil of mystery and obfuscation
over all the tough how-questions  -- Daniel C. Dennett



More information about the Chicago mailing list