extract occurrence of regular expression from elements of XML documents

Tue Mar 16 03:50:30 EDT 2010

Martin Schmidt, 15.03.2010 18:16:
> I have just started to use Python a few weeks ago and until last week I had
> no knowledge of XML.
> Obviously my programming knowledge is pretty basic.
> Now I would like to use Python in combination with ca. 2000 XML documents
> (about 30 kb each) to search for certain regular expression within specific
> elements of these documents.

2000 * 30K isn't a huge problem, that's just 60M in total. If you just have 
to do it once, drop your performance concerns and just get a solution 
going. If you have to do it once a day, take care to use a tool that is not 
too resource consuming. If you have strict requirements to do it once a 
minute, use a fast machine with a couple of cores and do it in parallel. If 
you have a huge request workload and want to reverse index the XML to do 
all sorts of sophisticated queries on it, use a database instead.

> I would then like to record the number of occurrences of the regular
> expression within these elements.
> Moreover I would like to count the total number of words contained within
> these,

len(text.split()) will give you those.

BTW, is it document-style XML (with mixed content as in HTML) or is the 
text always withing a leaf element?

> and record the attribute of a higher level element that contains
> them.

An example would certainly help here.

> I was trying to figure out the best way how to do this, but got overwhelmed
> by the available information (e.g. posts using different approaches based on
> dom, sax, xpath, elementtree, expat).
> The outcome should be a file that lists the extracted attribute, the number
> of occurrences of the regular expression, and the total number of words.
> I did not find a post that addresses my problem.

Funny that you say that after stating that you were overwhelmed by the 
available information.

> If someone could help me with this I would really appreciate it.

Most likely, the solution with the best simplicity/performance trade-off 
would be xml.etree.cElementTree's iterparse(), intercept on each 
interesting tag name, and search its text/tail using the regexp. That's 
doable in a couple of lines.

But unless you provide more information, it's hard to give better advice.

Stefan