mapping HTML entities to Unicode (was: Re: Unicode name questions)
Martin v. Löwis
loewis at informatik.hu-berlin.de
Wed Apr 17 13:22:13 EDT 2002
Skip Montanaro <skip at pobox.com> writes:
> Is this of use to anyone besides me? ...
> I'd also be happy to dump this in favor of a preexisting solution.
I always thought that htmlentitydefs exists for that purpose. If that
is missing entities that are defined in any HTML version, that would
be a bug.
I think it is also unfortunate that htmlentitydefs maps from entity
names to either single-byte strings (if it fits to Latin-1) or decimal
character references (otherwise). Instead, it should map to unicode
strings.
So I emagine something like
unicode = {
'AElig': u'\u0132',
...
}
entitydefs = {}
reverse = {}
for k,v in unicode.items():
v = ord(v)
if (v < 256):
entitydefs[k] = chr(v)
else:
entitydefs[k] = '&#%d;' % v
reverse[v] = k
I think this would be quite useful, except that it might be debatable
whether the key to htmlentitydefs.reverse should be unicode characters
or ordinals.
Regards,
Martin
More information about the Python-list
mailing list