lxml precaching DTD for document verification.

Stefan Behnel stefan_ml at behnel.de
Mon Nov 28 02:38:12 EST 2011


Gelonida N, 27.11.2011 18:57:
> I'd like to verify some (x)html / / html5 / xml documents from a server.
>
> These documents have a very limited number of different doc types / DTDs.
>
> So what I would like to do is to build a small DTD cache and some code,
> that would avoid searching the DTDs over and over from the net.
>
> What would be the best way to do this?

Configure your XML catalogues.


> I guess, that
> the fields od en ElementTre, that I have to look at are
> docinfo.public_id
> docinfo.system_uri

Yes, catalogue lookups generally happen through the public ID.


> There's also mentioning af a catalogue, but I don't know how to
> use a catalog and how to know what is inside my catalogue
> and what isn't.

Does this help?

http://lxml.de/resolvers.html#xml-catalogs

http://xmlsoft.org/catalog.html

They should normally come pre-configured on Linux distributions, but you 
may have to install additional packages with the respective DTDs. Look for 
any packages with "dtd" and "html" in their name, for example.

Stefan




More information about the Python-list mailing list