[XML-SIG] HTML<->UTF-8 'codec'?

M.-A. Lemburg mal@lemburg.com
Fri, 19 Oct 2001 20:08:48 +0200


"Fred L. Drake, Jr." wrote:
> 
> Bill Janssen writes:
>  > First off, this seems like an obvious thing to do, so has someone
>  > already done it?  Or is there some obvious flaw in the idea which
>  > I just haven't seen?
> 
>   I haven't seen it, either, but it would be really nice.  Most people
> don't want to end up with &#...; character references; they'd rather
> have the general entity references.

I've written one of these for a customer; can't release it though.

Note that even though humans tend to like named entities a lot,
numeric entities are usually much easier to handle and parse
(just think of the hoops that are needed to get these thingies
parsed correctly in XML...).
 
>  > Secondly, is there any documentation on the _codecs module, which
>  > seems full of interesting and useful stuff for this purpose?
> 
>   No.  There is limited documentation on the codecs module, though.
> If you'd like to extend that while you're at it, I'd certainly
> appreciate it!

The _codecs module is basically just a helper to make the internal
codecs available. All of these are documented in great detail 
in the C API reference and the unicodeobject.h header file.
 
>  > Thirdly, what's the equivalent of chr() for Unicode characters?
> 
>   unichr() is a built-in function which does this; see the docs if you
> need more information.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Consulting & Company:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/