[XML-SIG] Character entities (XHTML)

Thomas B. Passin tpassin@comcast.net
Tue, 07 May 2002 19:49:39 -0400


I'm sure others have told you the same thing.  When an xml parser par=
ses
your file, it replaces any character references and entities with the=
ir
corresponding characters.  There is no memory of how they came to be.

There are really only two ways to get the entities back in to the out=
put.
Either you use or write a serializer to replace certain characters wi=
th your
entities (and which ones to replace will depend on the encoding), or =
you do
some preprocessing to replace the entities with some encoded version,=
 then
convert them back with postprocessing.

But here you seem to be running HTML Tidy, which may not even be hand=
ling
the source file as xml depending on how you have configured it. Other=
wise,
how did the Tidy "meta" element get into it when you don't show it in=
 the
source file?   In fact, without a DTD your file cannot be processed b=
y an
ordinary xml parser because the values of the entities cannot be know=
n. So
chances are you ran it in xhtml mode, not xml mode, before you fed th=
e
result to the Python modules.  This list probably isn't going to be a=
ble to
help you with the idiosyncracies of Tidy.

Maybe you didn't do that, but you need to explain what you really did
because the way you show it, the Python program couldn't have complet=
ed and
there would be no "Meta" element.

Cheers,

Tom P

[Andrew Cooke]

In the example below I am losing the XHTML character entities.  How d=
o I
avoid this?
I've also posted to the ng - apologies to people seeing this same que=
stion
twice.

Andrew

Input file:
<html>
  <head>
    <link type=3D"text/css" rel=3D"stylesheet" href=3D"basic.css" />
    <title>Index</title>
  </head>
  <body>
  <h1>=A1Hola!</h1>
<a href=3D"init">initialisaci&oacute;n</a>
  </body>
</html>

And when this is processed, I see (note that the SGML entity &lt; doe=
s
appear, but oacute and iexcl don't):
F:\home\Andrew\multi\src\xhtml>python
Python 2.2.1 (#34, Apr  9 2002, 19:34:33) [MSC 32 bit (Intel)] on win=
32
Type "help", "copyright", "credits" or "license" for more information=
.
>>> from xml.dom.ext.reader.Sax2 import FromXmlFile
>>> from xml.dom.ext import PrettyPrint
>>> PrettyPrint(FromXmlFile("index.xhtml"))
<?xml version=3D'1.0' encoding=3D'UTF-8'?>
<!DOCTYPE html>
<html xmlns=3D'http://www.w3.org/1999/xhtml'>
  <head>
    <meta content=3D'HTML Tidy for Cygwin (vers 1st April 2002), see
www.w3.org' n
ame=3D'generator'/>
    <link href=3D'basic.css' rel=3D'stylesheet' type=3D'text/css'/>
    <title>Index</title>
  </head>
  <body>
    <h1>&lt;Hola!</h1>
    <a href=3D'init'>initialisacin</a>
  </body>
</html>
>>>







_______________________________________________
XML-SIG maillist  -  XML-SIG@python.org
http://mail.python.org/mailman/listinfo/xml-sig