Regular Expression

Tue Oct 23 18:13:37 EDT 2007

This is related to my last post (see:
http://groups.google.com/group/comp.lang.python/browse_thread/thread/c333cbbb5d496584/998af2bb2ca10e88#998af2bb2ca10e88)

I have a text file with an EINECS number, a CAS number, a Chemical
Name, and a Chemical Formula, always in this order.  However, I
realized as I ran my script that I had entries like

274-989-4
70892-58-9
diazotovaná kyselina 4-
aminobenzénsulfónová, kopulovaná s
farbiarskym morušovým (Chlorophora
tinctoria) extraktom, komplexy so
železom
komplexy železa s produktami
kopulácie diazotovanej kyseliny 4-
aminobenzénsulfónovej s látkou
registrovanou v Indexe farieb pod
identifika ným  íslom Indexu farieb,
C.I. 75240.

which become

274-989-4|70892-58-9|diazotovaná kyselina 4- aminobenzénsulfónová,
kopulovaná s farbiarskym morušovým (Chlorophora tinctoria) extraktom,
komplexy so železom komplexy železa s produktami kopulácie
diazotovanej kyseliny 4- aminobenzénsulfónovej s látkou registrovanou
v Indexe farieb pod identifika ným  íslom Indexu farieb, C.I.|75240.

The C.I 75240 is not a chemical formula and there isn't one.  So I
want to add a regular expression for the chemical name for an if
statement that stipulates if there is not chemical formula to move
on.  However, I must be getting confused from the regular expression
tutorials I've been reading.

Any ideas?

Original Code:

#For text files in a directory...
#Analyzes a randomly organized UTF8 document with EINECS, CAS,
Chemical, and Chemical Formula
#into a document structured as EINECS|CAS|Chemical|Chemical Formula.

import os
import codecs
import re

path = "C:\\text_samples\\text"                #folder with all text
files
path2 = "C:\\text_samples\\text\\output"       #output of all text
files

NR_RE = re.compile(r'^\d+-\d+-\d+$')           #pattern for EINECS
number

def iter_elements(tokens):
    product = []
    for tok in tokens:
        if NR_RE.match(tok) and len(product) >= 4:
            product[2:-1] = [' '.join(product[2:-1])]
            yield product
            product = []
        product.append(tok)
    yield product

for text in os.listdir(path):
    input_text = os.path.join(path,text)
    output_text = os.path.join(path2,text)
    input = codecs.open(input_text, 'r','utf8')
    output = codecs.open(output_text, 'w', 'utf8')
    tokens = input.read().split()
    for element in iter_elements(tokens):
        #print '|'.join(element)
        output.write('|'.join(element))
        output.write("\r\n")

input.close()
output.close()

On Oct 23, 5:03 pm, Paul McGuire <pt... at austin.rr.com> wrote:
> On Oct 22, 5:29 pm, patrick.wa... at gmail.com wrote:
>
>
>
> > Hi,
>
> > I'm trying to learn regular expressions, but I am having trouble with
> > this.  I want to search a document that has mixed data; however, the
> > last line of every entry has something like C5H4N4O3 or CH5N3.ClH.
> > All of the letters are upper case and there will always be numbers and
> > possibly one .
>
> > However below only gave me none.
>
> > import os, codecs, re
>
> > text = 'C:\\text_samples\\sample.txt'
> > text = codecs.open(text,'r','utf-8')
>
> > test = re.compile('\u+\d+\.')
>
> > for line in text:
> >     print test.search(line)
>
> If those are chemical symbols, then I guarantee that there will be
> lower case letters in the expression (like the "l" in "ClH").
>
> -- Paul