Convert from unicode chars to HTML entities

"Martin v. Löwis" martin at v.loewis.de
Mon Jan 29 16:11:51 EST 2007


Steven D'Aprano schrieb:
> A few issues:
> 
> (1) It doesn't seem to be reversible:
> 
>>>> '© and many more...'.decode('latin-1')
> u'© and many more...'
> 
> What should I do instead?

For reverse processing, you need to parse it with an
SGML/XML parser.

> (2) Are XML entities guaranteed to be the same as HTML entities?

Please make a terminology difference between "entity", "entity
reference", and "character reference".

An (external parsed) entity is a named piece of text, such
as the copyright character. An entity reference is a reference
to such a thing, e.g. ©

A character reference is a reference to a character, not to
an entity. xmlcharrefreplace generates character references,
not entity references (let alone generating entities). The
character references in XML and HTML both reference by
Unicode ordinal, so it is "the same".

> (3) Is there a way to find out at runtime what encoders/decoders/error
> handlers are available, and what they do? 

Not through Python code. In C code, you can look at the
codec_error_registry field of the interpreter object.

Regards,
Martin



More information about the Python-list mailing list