Simple Text Processing Help

Sun Oct 14 10:25:13 EDT 2007

On Oct 14, 2:48 pm, patrick.wa... at gmail.com wrote:
> Hi all,
>
> I started Python just a little while ago and I am stuck on something
> that is really simple, but I just can't figure out.
>
> Essentially I need to take a text document with some chemical
> information in Czech and organize it into another text file.  The
> information is always EINECS number, CAS, chemical name, and formula
> in tables.  I need to organize them into lines with | in between.  So
> it goes from:
>
> 200-763-1                     71-73-8
> nátrium-tiopentál           C11H18N2O2S.Na           to:
>
> 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
>
> but if I have a chemical like: kyselina močová
>
> I get:
> 200-720-7|69-93-2|kyselina|močová
> |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
>
> and then it is all off.
>
> How can I get Python to realize that a chemical name may have a space
> in it?

In the original file, is every chemical on a line of its own? I assume
it is here.

You might use a regexp (look at the re module), or I think here you
can use the fact that only chemicals have spaces in them. Then, you
can split each line on whitespace (like you're doing), and join back
together all the words between the 3rd (ie index 2) and the last (ie
index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses
the somewhat unusual python syntax for replacing a section of a list
with another list.

The approach you took involves reading the whole file, and building a
list of all the chemicals which you don't seem to use: I've changed it
to a per-line version and removed the big lists.

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
    tokens = line.strip().split()
    tokens[2:-1] = [u' '.join(tokens[2:-1])]
    chemical = u'|'.join(tokens)
    print chemical + u'\n'
    output.write(chemical + u'\r\n')

input.close()
output.close()

Obviously, this isn't tested because I don't have your chem_1_utf8.txt
file.

--
Paul Hankin