Least-lossy string.encode to us-ascii?

Thu Sep 13 18:00:45 EDT 2012

Am 13.09.2012 23:26, schrieb Tim Chase:
> I've got a bunch of text in Portuguese and to transmit them, need to
> have them in us-ascii (7-bit).  I'd like to keep as much information
> as possible, just stripping accents, cedillas, tildes, etc.  So
> "serviço móvil" becomes "servico movil".  Is there anything stock
> that I've missed?  I can do mystring.encode('us-ascii', 'replace')
> but that doesn't keep as much information as I'd hope.

The unidecode [1] package contains a large mapping of unicode chars to
ASCII. It even supports cool stuff like Chinese to ASCII:

>>> import unidecode
>>> print u"\u5317\u4EB0"
北亰
>>> print unidecode.unidecode(u"\u5317\u4EB0")
Bei Jing

icu4c and pyicu [2] may contain more methods for conversion but they
require binary extensions. By the way ICU can do a lot of cool, too:

>>> import icu
>>> rbf = icu.RuleBasedNumberFormat(icu.URBNFRuleSetTag.SPELLOUT,
icu.Locale.getUS())
>>> rbf.format(23)
u'twenty-three'
>>> rbf.format(100000)
u'one hundred thousand'

Regards,
Christian

[1] http://pypi.python.org/pypi/Unidecode/0.04.9
[2] http://pypi.python.org/pypi/PyICU/1.4