[Tutor] Is there a package to "un-mangle" characters?

Fri Nov 22 03:29:35 CET 2013

On Thu, Nov 21, 2013 at 3:04 PM, Albert-Jan Roskam <fomcl at yahoo.com> wrote:
>
> Today I had a csv file in utf-8 encoding, but part of the accented
> characters were mangled. The data were scraped from a website and it
> turned out that at least some of the data were mangled on the website
> already. Bits of the text were actually cp1252 (or cp850), I think,
> even though the webpage was in utf-8 Is there any package that helps
> to correct such issues?

The links in the Wikipedia article may help:

    http://en.wikipedia.org/wiki/Charset_detection

International Components for Unicode (ICU) does charset detection:

    http://userguide.icu-project.org/conversion/detection

Python wrapper:

    http://pypi.python.org/pypi/PyICU
    http://packages.debian.org/wheezy/python-pyicu

Example:

    import icu

    russian_text = u'Здесь некий текст на русском языке.'
    encoded_text = russian_text.encode('windows-1251')

    cd = icu.CharsetDetector()
    cd.setText(encoded_text)
    match = cd.detect()
    matches = cd.detectAll()

    >>> match.getName()
    'windows-1251'
    >>> match.getConfidence()
    33
    >>> match.getLanguage()
    'ru'

    >>> [m.getName() for m in matches]
    ['windows-1251', 'ISO-8859-6', 'ISO-8859-8-I', 'ISO-8859-8']
    >>> [m.getConfidence() for m in matches]
    [33, 13, 8, 8]