mapping HTML entities to Unicode (was: Re: Unicode name questions)

Martin v. Löwis loewis at informatik.hu-berlin.de
Wed Apr 17 13:22:13 EDT 2002


Skip Montanaro <skip at pobox.com> writes:

> Is this of use to anyone besides me?  ...
> I'd also be happy to dump this in favor of a preexisting solution.

I always thought that htmlentitydefs exists for that purpose. If that
is missing entities that are defined in any HTML version, that would
be a bug.

I think it is also unfortunate that htmlentitydefs maps from entity
names to either single-byte strings (if it fits to Latin-1) or decimal
character references (otherwise). Instead, it should map to unicode
strings.

So I emagine something like

unicode = {
  'AElig': u'\u0132',
  ...
}

entitydefs = {}
reverse = {}
for k,v in unicode.items():
  v = ord(v)
  if (v < 256):
    entitydefs[k] = chr(v)
  else:
    entitydefs[k] = '&#%d;' % v
  reverse[v] = k

I think this would be quite useful, except that it might be debatable
whether the key to htmlentitydefs.reverse should be unicode characters
or ordinals.

Regards,
Martin




More information about the Python-list mailing list