regular expression to extract text

Thu Nov 20 10:52:04 EST 2003

"Peter Hansen" <peter at engcorp.com> wrote in message
news:3FBCDFB3.E01417E1 at engcorp.com...
> Mark Light wrote:
> >
> > Hi I have a file read in as a string that looks like below. What I want
to
> > do is pull out the bits of information to eventually put in an html
table.
> > FOr the 1st example the 3 bits are:
> > 1.QEXZUO
> > 2. C26 H31 N1 O3
> > 3. 6.164   15.892   22.551    90.00    90.00    90.00
> >
> > ANy ideas of the best way to do this - I was trying regular expressions
but
> > not getting very far.
> >
> > Thanks,
> >
> > Mark.
> >
> > """
> > Using unit cell orientation matrix from collect.rmat
> > NOTICE: Performing automatic cell standardization
> > The following database entries have similar unit cells:
> > Refcode     Sumformula
> >       <Conventional cell parameters>
> > ------------------------------------------
> > QEXZUO     C26 H31 N1 O3
> >          6.164   15.892   22.551    90.00    90.00    90.00
> > ------------------------------------------
> > ARQTYD     C19 H23 N1 O5
> >          6.001   15.227   22.558    90.00    90.00    90.00
> > ------------------------------------------
> > NHDIIS     C45 H40 Cl2
> >          6.532   15.147   22.453    90.00    90.00    90.00 """
>
> I don't think you've given enough information here.  Are those
> "bits" supposed to be kept intact, complete with internal spacing,
> or are you doing more manipulation of them?  What is the definition
> of the "bits"?  Specifically, is bit 1 "the first non-space token
> after a line of hyphens"?  Is bit 2 "everything on the line after
> bit 1, with leading and trailing spaces stripped"?  Is bit 3
> "everything on the following line, with leading/trailing spaces
> stripped"?
>
> Those definitions roughly fit what you describe, and if that's
> all you need, the solution should be pretty trivial, without
> having to use regular expressions which would be overkill in this
> case.

Sorry for being inexact - the definitions you proposed do fit the bill.

Mark.