Simple Text Processing Help

patrick.waldo at gmail.com patrick.waldo at gmail.com
Mon Oct 15 06:43:19 EDT 2007


>     lines = open('your_file.txt').readlines()[:4]
>     print lines
>     print map(len, lines)

gave me:
['\xef\xbb\xbf200-720-7        69-93-2\n', 'kyselina mo\xc4\x8dov
\xc3\xa1      C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n']
[28, 32, 1, 18]

I think it means that I'm still at option 3.  I got the line by line
part.  My code is a lot cleaner now:

import codecs

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
    tokens = line.strip().split()
    tokens[2:-1] = [u' '.join(tokens[2:-1])]   #this doesn't seem to
combine the files correctly
    file = u'|'.join(tokens)                   #this does put '|' in
between
    print file + u'\n'
    output.write(file + u'\r\n')

input.close()
output.close()

my sample input file looks like this( not organized,as you see it):
200-720-7        69-93-2
kyselina mocová      C5H4N4O3

200-001-8	50-00-0
formaldehyd      CH2O

200-002-3
50-01-1
guanidínium-chlorid      CH5N3.ClH

etc...

and after the program I get:

200-720-7|69-93-2|
kyselina|mocová||C5H4N4O3

200-001-8|50-00-0|
formaldehyd|CH2O|

200-002-3|
50-01-1|
guanidínium-chlorid|CH5N3.ClH|

etc...
So, I am sort of back at the start again.

If I add:

tokens = line.strip().split()
for token in tokens:
    print token

I get all the single tokens, which I thought I could then put
together, except when I did:

for token in tokens:
     s = u'|'.join(token)
     print s

I got ?|2|0|0|-|7|2|0|-|7, etc...

How can I join these together into nice neat little lines?  When I try
to store the tokens in a list, the tokens double and I don't know
why.  I can work on getting the chemical names together after...baby
steps, or maybe I am just missing something obvious.  The first two
numbers will always be the same three digits-three digits-one digit
and then two digits-two digits-one digit.

My intuition tells me that I need to add an if statement that says, if
the first two numbers follow the pattern, then continue, if they don't
(ie a chemical name was accidently split apart) then the third entry
needs to be put together.  Something like
if tokens.startswith('pattern') == true


Again, thanks so much.  I've gone to http://gnosis.cx/TPiP/ and I have
a couple O'Reilly books, but they don't seem to have a straightforward
example for this kind of text manipulation.

Patrick


On Oct 14, 11:17 pm, John Machin <sjmac... at lexicon.net> wrote:
> On Oct 14, 11:48 pm, patrick.wa... at gmail.com wrote:
>
>
>
> > Hi all,
>
> > I started Python just a little while ago and I am stuck on something
> > that is really simple, but I just can't figure out.
>
> > Essentially I need to take a text document with some chemical
> > information in Czech and organize it into another text file.  The
> > information is always EINECS number, CAS, chemical name, and formula
> > in tables.  I need to organize them into lines with | in between.  So
> > it goes from:
>
> > 200-763-1                     71-73-8
> > nátrium-tiopentál           C11H18N2O2S.Na           to:
>
> > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
>
> > but if I have a chemical like: kyselina močová
>
> > I get:
> > 200-720-7|69-93-2|kyselina|močová
> > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
>
> > and then it is all off.
>
> > How can I get Python to realize that a chemical name may have a space
> > in it?
>
> Your input file could be in one of THREE formats:
> (1) fields are separated by TAB characters (represented in Python by
> the escape sequence '\t', and equivalent to '\x09')
> (2) fields are fixed width and padded with spaces
> (3) fields are separated by a random number of whitespace characters
> (and can contain spaces).
>
> What makes you sure that you have format 3? You might like to try
> something like
>     lines = open('your_file.txt').readlines()[:4]
>     print lines
>     print map(len, lines)
> This will print a *precise* representation of what is in the first
> four lines, plus their lengths. Please show us the output.





More information about the Python-list mailing list