Simple character translation problem

Steffen Ries steffen.ries at sympatico.ca
Sat Sep 22 09:41:17 EDT 2001


Martin von Loewis <loewis at informatik.hu-berlin.de> writes:

> David Eppstein <eppstein at ics.uci.edu> writes:
> 
> > I have: user input text, in Mac character set encoding
> > I want: ASCII with HTML-entities coding the accented characters.
> > 
> > E.g. "café" should become "café".
> > Is there code already in Python to do this easily?
... 
> Or, you could try to use external entities where possible. For that,
> please have a look at htmlentitydefs.entitydefs. Using that is not
> straight forward: you have to invert the dictionary, and you have to
> convert the keys into Unicode keys. For the keys that are
> single-character strings (e.g. '\306'), you can use the Unicode
> character with the same ordinal. For characters above 255, you have to
> convert between the character entity and a Unicode character.

Ok, I'll bite:
--8<--
_u2html = {}   # unicode to html mapping

def _make_u2html():
    from htmlentitydefs import entitydefs

    def c2u(c):
        if len(c) == 1:
            return unicode(c, 'latin1')
        if c.startswith('&#'):
            return unichr(int(c[2:-1]))
        
    for entity,val in entitydefs.items():
        _u2html[c2u(val)] = "&%s;" % entity

def htmlentityEncode(s):
    """
    convert unicode string s to ascii, replace non-ascii characters with
    html entitydef or "?"
    """

    if not _u2html:
        _make_u2html()

    l = [_u2html.get(c, c) for c in s]

    return ''.join(l).encode('ascii', 'replace')
--8<--

>>> htmlentityEncode(u"café")
'café'

/steffen
-- 
steffen.ries at sympatico.ca	<> Gravity is a myth -- the Earth sucks!



More information about the Python-list mailing list