nonstandard XML character entities?

Paul Rubin http
Sat Apr 14 15:44:43 EDT 2007


"Martin v. Löwis" <martin at v.loewis.de> writes:
> If they contain such things, and do not contain a document type
> definition, they are not well-formed XML files (i.e. can't be
> called "XML" in a meaningful sense).

The documents do have a DTD, however the DTD file doesn't say anything
about these entities.

> It would have been helpful if you had given an example of such
> a document.

I can't post a whole document because these docs are very large
and I'm not sure that the data is public.  It does look like the DTD
is public: the document begins with

   <?xml version="1.0"  encoding="ISO-8859-1"?>
   <!DOCTYPE ONIXmessage SYSTEM "http://www.editeur.org/onix/2.1/short/onix-international.dtd">
   <ONIXmessage release="2.1">
   ...

and that url points to the DTD which is online.

Basically the doc has elements like

   <b036>Diana Montané</b036>
 
and both ElementTree and xmllint complain about the character entities
(and there are a lot of them).

> If there is a document type declaration in the document, the best
> way is to parse it in a mode where the parser downloads the DTD
> when parsing it, and resolves the entity references itself.

Hmm, ok, I see there are a lot of <!ENTITY ...> directives in the
DTD but nothing about those character entities--am I looking in
the right place?

> In ElementTree, the XMLTreeBuilder has an attribute entity
> which is a dictionary used to map entity names in entity references
> to their definitions. Whether you can make the parser download
> the DTD itself, I don't know.

Chuck Rhode posted some code for something like this so I'll try it
on Monday.

Thanks!



More information about the Python-list mailing list