[Chicago] Chardet help
Tathagata Dasgupta
tathagatadg at gmail.com
Sun Mar 10 18:39:56 CET 2013
On Sun, Mar 10, 2013 at 10:51 AM, Martin Maney <maney at two14.net> wrote:
> On Sun, Mar 10, 2013 at 10:00:43AM -0500, Tathagata Dasgupta wrote:
> > def getEncoding(infile):
> > import chardet
> > rawdata = open(infile, "r").read()
> > result = chardet.detect(rawdata)
> > charenc = result['encoding']
> > print charenc
> >
> > That gives me ISO-8859-2.
>
> That may be the problem. Why would Italian text be encoded in the
> Central European character set? From a quick look at the raw data in
>
Hmm, sorry no clue with that - i got them from another research group.
However, what makes me sad is all other tools (in Windows:Cygwin) like
sort, uniq or the different editors are happily detecting and operating on
them!
Also, I couldn't find any specific encoding for Italian in
http://docs.python.org/2/library/codecs.html.
> the browser, 8859-2 is obviously incorrect. 8859-1 looks better! In
> fact, it looks better that 8859-3, the Southern European variant.
>
I was referring to
http://scratchpad.wikia.com/wiki/Character_Encoding_Recommendation_for_Languagesand
tried with 8859-1, 8859-3, 8859-9, and 8859-15 and all have similar
reactions.
> Guessing what encoding a text is in is always a pain. I don't know
> what that chardet is, but from the results it appears to be less than
> reliable.
>
https://pypi.python.org/pypi/chardet - found from stackoverflow
>
> Caveat: my guess is based on which encodings leave "unknown code point"
> blobs and/or accent marks which I'm fairly sure Italian doesn't use.
> But I have no Italian, myself.
>
Yeah me neither :P !
>
> --
> The dualist evades the frame problem - but only because
> dualism draws the veil of mystery and obfuscation
> over all the tough how-questions -- Daniel C. Dennett
>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
>
--
Cheers,
T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20130310/beec48d2/attachment.html>
More information about the Chicago
mailing list