validating XML

Thu Jun 14 15:19:31 EDT 2012

andrea crotti <andrea.crotti.0 at gmail.com> writes:
> ...
> The reason is that it has to work on many platforms and without any c module
> installed, the reason of that

Searching for a pure Python solution, you might have a look at "PyXB".

It has not been designed to validate XML instances against XML-Schema
(but to map between XML instances and Python objects based on
an XML-Schema description) but it detects many problems in the
XML instances. It does not introduce its own C extensions
(but relies on an XML parser shipped with Python).

> Anyway in a sense it's also quite interesting, and I don't need to implement
> the whole XML, so it should be fine.

The XML is the lesser problem. The big problem is XML-Schema: it is
*very* complex with structure definitions (elements, attributes and
"#PCData"), inheritance, redefinition, grouping, scoping rules, inclusion,
data types with restrictions and extensions.

Thus if you want to implement a reliable algorithm which for
given XML-schema and XML-instance checks whether the instance is
valid with respect to the schema, then you have a really big task.

Maybe, you have a fixed (and quite simple) schema. Then
you may be able to implement a validator (for the fixed schema).
But I do not understand why you would want such a validation.
If you generate the XML instances, then thouroughly test your
generation process (using any available validator) and then trust it.
If the XML instances come from somewhere else and must be interpreted
by your application, then the important thing is that they are
understood by your application, not that they are valid.
If you get a complaint that your application cannot handle a specific
XML instance, then you validate it in your development environment
(again with any validator available) and if the validation fails,
you have good arguments.

> What I haven't found yet is an explanation of a possible algorithm to use for
> the validation, that I could then implement..

You parse the XML (and get a tree) and then recursively check
that the elements, attributes and text nodes in the tree
conform to the schema (in an abstract sense,
the schema is a collection of content models for the various elements;
each content model tells you how the element content and attributes
should look like).
For a simple schema, this is straight forward. If the schema starts
to include foreign schemas, uses extensions, restrictions or "redefine"s,
then it gets considerably more difficult.

--
Dieter