Simple Text Processing Help

patrick.waldo at gmail.com patrick.waldo at gmail.com
Sun Oct 14 12:57:06 EDT 2007


Thank you both for helping me out.  I am still rather new to Python
and so I'm probably trying to reinvent the wheel here.

When I try to do Paul's response, I get
>>>tokens = line.strip().split()
[]

So I am not quite sure how to read line by line.

tokens = input.read().split() gets me all the information from the
file.  tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like
in the example; however, how can I loop this for the entire document?
Also, when I try output.write(tokens), I get "TypeError: coercing to
Unicode: need string or buffer, list found".

Any ideas?

















On Oct 14, 4:25 pm, Paul Hankin <paul.han... at gmail.com> wrote:
> On Oct 14, 2:48 pm, patrick.wa... at gmail.com wrote:
>
>
>
> > Hi all,
>
> > I started Python just a little while ago and I am stuck on something
> > that is really simple, but I just can't figure out.
>
> > Essentially I need to take a text document with some chemical
> > information in Czech and organize it into another text file.  The
> > information is always EINECS number, CAS, chemical name, and formula
> > in tables.  I need to organize them into lines with | in between.  So
> > it goes from:
>
> > 200-763-1                     71-73-8
> > nátrium-tiopentál           C11H18N2O2S.Na           to:
>
> > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
>
> > but if I have a chemical like: kyselina močová
>
> > I get:
> > 200-720-7|69-93-2|kyselina|močová
> > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
>
> > and then it is all off.
>
> > How can I get Python to realize that a chemical name may have a space
> > in it?
>
> In the original file, is every chemical on a line of its own? I assume
> it is here.
>
> You might use a regexp (look at the re module), or I think here you
> can use the fact that only chemicals have spaces in them. Then, you
> can split each line on whitespace (like you're doing), and join back
> together all the words between the 3rd (ie index 2) and the last (ie
> index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses
> the somewhat unusual python syntax for replacing a section of a list
> with another list.
>
> The approach you took involves reading the whole file, and building a
> list of all the chemicals which you don't seem to use: I've changed it
> to a per-line version and removed the big lists.
>
> path = "c:\\text_samples\\chem_1_utf8.txt"
> path2 = "c:\\text_samples\\chem_2.txt"
> input = codecs.open(path, 'r','utf8')
> output = codecs.open(path2, 'w', 'utf8')
>
> for line in input:
>     tokens = line.strip().split()
>     tokens[2:-1] = [u' '.join(tokens[2:-1])]
>     chemical = u'|'.join(tokens)
>     print chemical + u'\n'
>     output.write(chemical + u'\r\n')
>
> input.close()
> output.close()
>
> Obviously, this isn't tested because I don't have your chem_1_utf8.txt
> file.
>
> --
> Paul Hankin





More information about the Python-list mailing list