Simple Text Processing Help

John Machin sjmachin at lexicon.net
Sun Oct 14 17:17:12 EDT 2007


On Oct 14, 11:48 pm, patrick.wa... at gmail.com wrote:
> Hi all,
>
> I started Python just a little while ago and I am stuck on something
> that is really simple, but I just can't figure out.
>
> Essentially I need to take a text document with some chemical
> information in Czech and organize it into another text file.  The
> information is always EINECS number, CAS, chemical name, and formula
> in tables.  I need to organize them into lines with | in between.  So
> it goes from:
>
> 200-763-1                     71-73-8
> nátrium-tiopentál           C11H18N2O2S.Na           to:
>
> 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
>
> but if I have a chemical like: kyselina močová
>
> I get:
> 200-720-7|69-93-2|kyselina|močová
> |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
>
> and then it is all off.
>
> How can I get Python to realize that a chemical name may have a space
> in it?
>

Your input file could be in one of THREE formats:
(1) fields are separated by TAB characters (represented in Python by
the escape sequence '\t', and equivalent to '\x09')
(2) fields are fixed width and padded with spaces
(3) fields are separated by a random number of whitespace characters
(and can contain spaces).

What makes you sure that you have format 3? You might like to try
something like
    lines = open('your_file.txt').readlines()[:4]
    print lines
    print map(len, lines)
This will print a *precise* representation of what is in the first
four lines, plus their lengths. Please show us the output.




More information about the Python-list mailing list