Simple Text Processing Help

Sun Oct 14 10:01:32 EDT 2007

On Sun, 14 Oct 2007 13:48:51 +0000, patrick.waldo wrote:

> Essentially I need to take a text document with some chemical
> information in Czech and organize it into another text file.  The
> information is always EINECS number, CAS, chemical name, and formula
> in tables.  I need to organize them into lines with | in between.  So
> it goes from:
> 
> 200-763-1                     71-73-8
> nátrium-tiopentál           C11H18N2O2S.Na           to:

Is that in *one* line in the input file or two lines like shown here?

> 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
> 
> but if I have a chemical like: kyselina močová
> 
> I get:
> 200-720-7|69-93-2|kyselina|močová
> |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
> 
> and then it is all off.
> 
> How can I get Python to realize that a chemical name may have a space
> in it?

If the two elements before and the one element after the name can't
contain spaces it is easy:  take the first two and the last as it is and
for the name take from the third to the next to last element = the name
and join them with a space.

In [202]: parts = '123 456 a name with spaces 789'.split()

In [203]: parts[0]
Out[203]: '123'

In [204]: parts[1]
Out[204]: '456'

In [205]: ' '.join(parts[2:-1])
Out[205]: 'a name with spaces'

In [206]: parts[-1]
Out[206]: '789'

This works too if the name doesn't have a space in it:

In [207]: parts = '123 456 name 789'.split()

In [208]: parts[0]
Out[208]: '123'

In [209]: parts[1]
Out[209]: '456'

In [210]: ' '.join(parts[2:-1])
Out[210]: 'name'

In [211]: parts[-1]
Out[211]: '789'

> #read and enter into a list
> chem_file = []
> chem_file.append(input.read())

This reads the whole file and puts it into a list.  This list will
*always* just contain *one* element.  So why a list at all!?

> #split words and store them in a list
> for word in chem_file:
>     words = word.split()

*If* the list would contain more than one element all would be processed
but only the last is bound to `words`.  You could leave out `chem_file` and
the loop and simply do:

words = input.read().split()

Same effect but less chatty.  ;-)

The rest of the source seems to indicate that you don't really want to read
in the whole input file at once but process it line by line, i.e. chemical
element by chemical element.

Ciao,
	Marc 'BlackJack' Rintsch