hard_decoding

Thu Feb 10 08:19:04 EST 2005

On Wed, Feb 09, 2005 at 05:22:12PM -0700, Tamas Hegedus wrote:
> Hi!
> 
> Do you have a convinient, easy way to remove special charachters from 
> u'strings'?
> 
> Replacing:
> ÀÁÂÃÄÅ 	=> A
> èéêë	=> e
> etc.
> 'L0xe1szl0xf3' => Laszlo
> or something like that:
> 'L\xc3\xa1szl\xc3\xb3' => Laszlo

for the examples you have given, this works:

    from unicodedata import normalize

    def strip_composition(unichar):
        """
        Return the first character from the canonical decomposition of
        a unicode character. This wil typically be the unaccented
        version of the character passed in (in Latin regions, at
        least).
        """
        return normalize('NFD', unichar)[0]

    def remove_special_chars(anystr):
        """
        strip_composition of the whole string
        """
        return ''.join(map(strip_composition, unicode(anystr)))

    for i in ('ÀÁÂÃÄÅ', 'èéêë',
              u'L\xe1szl\xf3',
              'L\xc3\xa1szl\xc3\xb3'):
        print i, '->', remove_special_chars(i)

produces:

    ÀÁÂÃÄÅ -> AAAAAA
    èéêë -> eeee
    László -> Laszlo
    László -> Laszlo

although building a translation mapping is, in general, faster. You
could use the above to build that map automatically, like this:

    def build_translation(sample, table=None):
        """
        Return a translation table that strips composition characters
        out of a sample unicode string. If a table is supplied, it
        will be updated.
        """
        assert isinstance(sample, unicode), 'sample must be unicode'
        if table is None:
            table = {}
        for i in set(sample) - set(table):
            table[ord(i)] = ord(strip_composition(i))
        return table

this is much faster on larger strings, or if you have many strings,
but know the alphabet (so you compute the table once). You might also
try to build the table incrementally,

    for i in strings:
        i = i.translate(table)
        try:
            i.encode('ascii')
        except UnicodeEncodeError:
            table = build_translation(i, table)
            i = i.translate(table)
        stripped.append(i)

of course this won't work if you have other, non-ascii but
non-composite, chars in your strings.

-- 
John Lenton (john at grulic.org.ar) -- Random fortune:
El que está en la aceña, muele; que el otro va y viene. 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: Digital signature
URL: <http://mail.python.org/pipermail/python-list/attachments/20050210/5928c67f/attachment.sig>