[Tutor] Is there a package to "un-mangle" characters?
eryksun
eryksun at gmail.com
Fri Nov 22 03:29:35 CET 2013
On Thu, Nov 21, 2013 at 3:04 PM, Albert-Jan Roskam <fomcl at yahoo.com> wrote:
>
> Today I had a csv file in utf-8 encoding, but part of the accented
> characters were mangled. The data were scraped from a website and it
> turned out that at least some of the data were mangled on the website
> already. Bits of the text were actually cp1252 (or cp850), I think,
> even though the webpage was in utf-8 Is there any package that helps
> to correct such issues?
The links in the Wikipedia article may help:
http://en.wikipedia.org/wiki/Charset_detection
International Components for Unicode (ICU) does charset detection:
http://userguide.icu-project.org/conversion/detection
Python wrapper:
http://pypi.python.org/pypi/PyICU
http://packages.debian.org/wheezy/python-pyicu
Example:
import icu
russian_text = u'Здесь некий текст на русском языке.'
encoded_text = russian_text.encode('windows-1251')
cd = icu.CharsetDetector()
cd.setText(encoded_text)
match = cd.detect()
matches = cd.detectAll()
>>> match.getName()
'windows-1251'
>>> match.getConfidence()
33
>>> match.getLanguage()
'ru'
>>> [m.getName() for m in matches]
['windows-1251', 'ISO-8859-6', 'ISO-8859-8-I', 'ISO-8859-8']
>>> [m.getConfidence() for m in matches]
[33, 13, 8, 8]
More information about the Tutor
mailing list