regular expression to extract text

Lonnie Princehouse fnord at u.washington.edu
Thu Nov 20 14:55:45 EST 2003


One of the beautiful things about Python is that,
while there is usually one obvious and reasonable
way to do something, there are many many ridiculous
ways to do it as well.  This is especially true when
regular expressions are involved.

I'd do it like this:  (Note that this wants the whole file as 
one string, so use read() instead of readline())


data = """
Using unit cell orientation matrix from collect.rmat
NOTICE: Performing automatic cell standardization
The following database entries have similar unit cells:
Refcode     Sumformula
      <Conventional cell parameters>
------------------------------------------
QEXZUO     C26 H31 N1 O3
         6.164   15.892   22.551    90.00    90.00    90.00
------------------------------------------
ARQTYD     C19 H23 N1 O5
         6.001   15.227   22.558    90.00    90.00    90.00
------------------------------------------
NHDIIS     C45 H40 Cl2
         6.532   15.147   22.453    90.00    90.00    90.00 """

import re

r1 = re.compile('\-+\n([A-Z]+)(.*?)(?:\-|$)', re.DOTALL)
r2 = re.compile('([A-Z]+\d+)', re.I)
r3 = re.compile('(\d+\.\d+)')

results = dict([ (name, {
            'isotopes': r2.findall(body), 
            'values': [float(x) for x in r3.findall(body)]
        }) for name, body in r1.findall(data) ])



I assumes that you want the numbers as floats instead of strings; 
if you're just going to print them out again, don't call float().

I also assume (perhaps wrongly) that the order of entries isn't 
important.  Don't do the dict() conversion if that assumption's wrong.

This yields:

{'ARQTYD': {'isotopes': ['C19', 'H23', 'N1', 'O5'],
            'values': [6.0010000000000003, 
                       15.227, 
                       22.558, 
                       90.0, 
                       90.0, 
                       90.0]},
 'NHDIIS': {'isotopes': ['C45', 'H40', 'Cl2'],
            'values': [6.532, 
                       15.147, 
                       22.452999999999999, 
                       90.0, 
                       90.0, 
                       90.0]},
 'QEXZUO': {'isotopes': ['C26', 'H31', 'N1', 'O3'],
            'values': [6.1639999999999997,
                       15.891999999999999,
                       22.550999999999998,
                       90.0,
                       90.0,
                       90.0]}}




More information about the Python-list mailing list