Regular Expression
patrick.waldo at gmail.com
patrick.waldo at gmail.com
Tue Oct 23 18:13:37 EDT 2007
This is related to my last post (see:
http://groups.google.com/group/comp.lang.python/browse_thread/thread/c333cbbb5d496584/998af2bb2ca10e88#998af2bb2ca10e88)
I have a text file with an EINECS number, a CAS number, a Chemical
Name, and a Chemical Formula, always in this order. However, I
realized as I ran my script that I had entries like
274-989-4
70892-58-9
diazotovaná kyselina 4-
aminobenzénsulfónová, kopulovaná s
farbiarskym moruovým (Chlorophora
tinctoria) extraktom, komplexy so
elezom
komplexy eleza s produktami
kopulácie diazotovanej kyseliny 4-
aminobenzénsulfónovej s látkou
registrovanou v Indexe farieb pod
identifika ným íslom Indexu farieb,
C.I. 75240.
which become
274-989-4|70892-58-9|diazotovaná kyselina 4- aminobenzénsulfónová,
kopulovaná s farbiarskym moruovým (Chlorophora tinctoria) extraktom,
komplexy so elezom komplexy eleza s produktami kopulácie
diazotovanej kyseliny 4- aminobenzénsulfónovej s látkou registrovanou
v Indexe farieb pod identifika ným íslom Indexu farieb, C.I.|75240.
The C.I 75240 is not a chemical formula and there isn't one. So I
want to add a regular expression for the chemical name for an if
statement that stipulates if there is not chemical formula to move
on. However, I must be getting confused from the regular expression
tutorials I've been reading.
Any ideas?
Original Code:
#For text files in a directory...
#Analyzes a randomly organized UTF8 document with EINECS, CAS,
Chemical, and Chemical Formula
#into a document structured as EINECS|CAS|Chemical|Chemical Formula.
import os
import codecs
import re
path = "C:\\text_samples\\text" #folder with all text
files
path2 = "C:\\text_samples\\text\\output" #output of all text
files
NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS
number
def iter_elements(tokens):
product = []
for tok in tokens:
if NR_RE.match(tok) and len(product) >= 4:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
product.append(tok)
yield product
for text in os.listdir(path):
input_text = os.path.join(path,text)
output_text = os.path.join(path2,text)
input = codecs.open(input_text, 'r','utf8')
output = codecs.open(output_text, 'w', 'utf8')
tokens = input.read().split()
for element in iter_elements(tokens):
#print '|'.join(element)
output.write('|'.join(element))
output.write("\r\n")
input.close()
output.close()
On Oct 23, 5:03 pm, Paul McGuire <pt... at austin.rr.com> wrote:
> On Oct 22, 5:29 pm, patrick.wa... at gmail.com wrote:
>
>
>
> > Hi,
>
> > I'm trying to learn regular expressions, but I am having trouble with
> > this. I want to search a document that has mixed data; however, the
> > last line of every entry has something like C5H4N4O3 or CH5N3.ClH.
> > All of the letters are upper case and there will always be numbers and
> > possibly one .
>
> > However below only gave me none.
>
> > import os, codecs, re
>
> > text = 'C:\\text_samples\\sample.txt'
> > text = codecs.open(text,'r','utf-8')
>
> > test = re.compile('\u+\d+\.')
>
> > for line in text:
> > print test.search(line)
>
> If those are chemical symbols, then I guarantee that there will be
> lower case letters in the expression (like the "l" in "ClH").
>
> -- Paul
More information about the Python-list
mailing list