[Chicago] Chardet help

Tathagata Dasgupta tathagatadg at gmail.com
Sun Mar 10 18:39:56 CET 2013


On Sun, Mar 10, 2013 at 10:51 AM, Martin Maney <maney at two14.net> wrote:

> On Sun, Mar 10, 2013 at 10:00:43AM -0500, Tathagata Dasgupta wrote:
> > def getEncoding(infile):
> > import chardet
> > rawdata = open(infile, "r").read()
> >  result = chardet.detect(rawdata)
> > charenc = result['encoding']
> > print charenc
> >
> > That gives me ISO-8859-2.
>
> That may be the problem.  Why would Italian text be encoded in the
> Central European character set?  From a quick look at the raw data in
>
Hmm, sorry no clue with that - i got them from another research group.
However, what makes me sad is all other tools (in Windows:Cygwin) like
sort, uniq or the different editors are happily detecting and operating on
them!
Also, I couldn't find any specific encoding for Italian in
http://docs.python.org/2/library/codecs.html.


> the browser, 8859-2 is obviously incorrect.  8859-1 looks better!  In
> fact, it looks better that 8859-3, the Southern European variant.
>

I was referring to
http://scratchpad.wikia.com/wiki/Character_Encoding_Recommendation_for_Languagesand
tried with  8859-1, 8859-3, 8859-9, and 8859-15 and all have similar
reactions.


> Guessing what encoding a text is in is always a pain.  I don't know
> what that chardet is, but from the results it appears to be less than
> reliable.
>
https://pypi.python.org/pypi/chardet - found from stackoverflow

>
> Caveat: my guess is based on which encodings leave "unknown code point"
> blobs and/or accent marks which I'm fairly sure Italian doesn't use.
> But I have no Italian, myself.
>
Yeah me neither :P !

>
> --
> The dualist evades the frame problem - but only because
> dualism draws the veil of mystery and obfuscation
> over all the tough how-questions  -- Daniel C. Dennett
>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
>



-- 
Cheers,
T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20130310/beec48d2/attachment.html>


More information about the Chicago mailing list