Bottleneck? More efficient regular expression?

Wed Sep 24 15:37:23 EDT 2003

Tina Li:
> Yes, I've shortened the XML snippet considerably as
> the data was obscurely long ...

But since there wasn't a problem with the code you posted
it's hard to figure out what's wrong.  (And since it wasn't
in XML it made it even harder for people to help.)

If it's too long (and a threaded alignment which includes the
full PDB structure is too long), put it on a web site somewhere
and give a URL to it.

> Andrew: Thanks for the XML code. I've written up something
> similar using xml.parsers.expat, but it's conceivably slower
> than regexp.

Conceivable, yes.  But 1) did you test it, and 2) would it make a
difference?

> And thanks for suggesting biopython.org. But I need to have it up
> quick so maybe cobbling something
> together myself is faster than reading through their documentation.

Not only that, but I don't know of anything in biopython for
handling that code.  It's a generic XML parsing question, so a
generic tool is the best, like Fredrick's ElementTree.

Here's a question for you.  When is it easier to read the
documentation for existing code (which has been tested)
then it is to write and debug your own code?

You can write the regex in a way that's easier to read

pattern = re.compile (
   r'<pdbcode>(?P<pdbcode>.*?)</pdbcode>.*?'
   r'<targetSeq name="(?P<targetName>.*?)">.*?'
   r'<target>(?P<target>.*?)</target>.*?'
   r'<align>(?P<align>.*?)</align>.*?'
   r'<template>(?P<template> .*?)</template>.*?'
   r'<pdbBackbone>(?P<pdbBackbone>.*?)</pdbBackbone>.*?'
   r'<queryGaps>(?P<queryGaps>.*?)</queryGaps>.*?'
   r'<templateGaps>(?P<templateGaps>.*?)</templateGaps>',
                                   re.S)

You can also make the pattern a bit less ambiguous,
eg, use [^>]* instead of .*? when you are inside an element,
which turns

   r'<pdbcode>(?P<pdbcode>.*?)</pdbcode>.*?'

into

   r'<pdbcode>(?P<pdbcode>[^<]*)</pdbcode>.*?'

(and use [^"]* instead of .*? for getting the text of an
attribute.)

You can get rid of the other ambiguity (skipping characters
until the start of the next tag) by using something like

  ([^<]|<(?!queryGaps))*<queryGaps

instead of

  .*?<queryGaps

And you can get rid of all ambiguities by having your
pattern match the text completely, that is, capturing all
fields even if ignore a few.  In that way you don't have to
write code which skips over tags.  That's easily done
with code which generated your pattern, as in

def match_tag(tagname, groupname):
  return "<%s>(?P<%s>[^<]*)</%s>" % (tagname, groupname)

pattern = "[^<]*".join(match_tag("pdbcode", "pdbcode"), ...)

You'll need to extend it to support matching fields with
attributes.

In any case, what you have won't support XML like

<pdbcode   >2PLV</pdbcode>

(I put extra spaces in the open tag.)

Instead, you are writing parsing code for the specific XML
subset your threading code produces, which may change in
the future.  That's why you should use an XML parser.

And if you want to support only this specific format, it's
still easier to write a traditional line-oriented parser.

def readcheck(infile, start):
  line = infile.readline()
  assert line.startswith(start)
  return line

def simpletag(line, convert = str):
  i = line.find(">")
  j = line.find("<", i)
  return convert(line[i:j])

infile = open("threading.xml")

line = readcheck(infile, "<threading")
_, name, _, source, _, template, _ = line.split('"')

line = readcheck(infile, "<pdbcode")
pdbcode = simpletag(line)
pdbchain = simpletag(readcheck(infile, "<pdbchain"))
templateName = simpletag(readcheck(infile, "<templateName"))

# don't save the settings
while 1:
  line = infile.readline()
  assert line
  if line.startswith("</settings>"):
    break

...

This may be clumsier or more tedious, but it's easy to understand
and the ways it fails are much easier to diagnose than regexps.

That said, I like parsing with regexps.  But this isn't the right
place to use them.

                    Andrew
                    dalke at dalkescientific.com