XML / Unicode / SAX question

Wed Jul 4 02:18:03 EDT 2007

IamIan wrote:
> I am using SAX to parse XML that has numeric html entities I need to
> convert and feed to JavaScript as part of a CGI. I can get the
> characters to print correctly, but not without being surrounded by
> linebreaks:
>
>   def characters(self, ch):
>     if self.isNews:
>       ch = unescape(ch)
>       print ch

The print statement introduces line breaks at the end. Use

    print ch,

instead. Note that you have to merge character sequences yourself in SAX.
There is no guarantee into how many chunks the textual context of a single tag
is broken before it is passed to the characters() SAX method.

> For a line like 'Mark à Capbreton'
> my results print as:
> 'Mark
> à
> Capbreton'
> 
> Is this another SAX quirk? I've already had to hack my way around SAX
> not being able to split results on a colon. No matter if I try strip,
> etc the results are always the same: newlines surrounding the html
> entities. I'm using version 2.3.5 and need to stick to the standard
> libraries. Thanks.

Too bad. If an external library was acceptable (Python 2.3 is ok), I would
have proposed lxml, maybe lxml.html (which will be in lxml 2.0), or the Atom
implementation on top of lxml.etree.

http://codespeak.net/lxml
http://codespeak.net/svn/lxml/branch/html/
https://svn.openplans.org/svn/TaggerClient/trunk/taggerclient/atom.py

Hope it helps,
Stefan