[Tutor] Iterating over a long list with regular expressions and changing each item?
Dan Liang
danliang20 at gmail.com
Mon May 4 16:15:35 CEST 2009
Hi Spir and tutors,
Thank you Spir for your response. I went ahead and tried your code after
adding a couple of dictionary entries, as below:
-----------Code Begins---------------
#!usr/bin/python
tags = {
'case_def_gen':['case_def','gen','null'],
'nsuff_fem_pl':['nsuff','null', 'null'],
'abbrev': ['abbrev, null, null'],
'adj': ['adj, null, null'],
'adv': ['adv, null, null'],} # tag dict
TAB = '\t'
def newlyTaggedWord(line):
(word,tag) = line.split(TAB) # separate parts of line, keeping
data only
new_tags = tags['tag'] # read in dict--Index by string
tagging = TAB.join(new_tags) # join with TABs
return word + TAB + tagging # formatted result
def replaceTagging(source_name, target_name):
source_file = file(source_name, 'r')
source = source_file.read() # not really necessary
target_file = open(target_name, "w")
# replacement loop
for line in source:
new_line = newlyTaggedWord(line) + '\n'
target_file.write(new_line)
source_file.close()
target_file.close()
if __name__ == "__main__":
source_name = sys.argv[1]
target_name = sys.argv[2]
replaceTagging(source_name, target_name)
-----------Code Ends---------------
The file I am working on looks like this:
word \t case_def_gen
word \t nsuff_fem_pl
word \t adj
word \t abbrev
word \t adv
I get the following error when I try to run it, and I cannot figure out
where the problem lies:
-----------Error Begins---------------
Traceback (most recent call last):
File "tag.formatter.py", line 36, in ?
replaceTagging(source_name, target_name)
File "tag.formatter.py", line 28, in replaceTagging
new_line = newlyTaggedWord(line) + '\n'
File "tag.formatter.py", line 16, in newlyTaggedWord
(word,tag) = line.split(TAB) # separate parts of line, keeping data
only
ValueError: unpack list of wrong size
-----------Error Ends---------------
Any ideas?
Thank you!
--dan
From: Dan Liang <danliang20 at gmail.com>
Subject: [Tutor] Iterating over a long list with regular expressions
and changing each item?
To: tutor at python.org
Message-ID:
<a0e59afb0905031859k1d54bddck91955eb5b90ae501 at mail.gmail.com
>
> >
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi tutors,
>
> I am working on a file and need to replace each occurrence of a certain
> label (part of speech tag in this case) by a number of sub-labels. The file
> has the following format:
>
> word1 \t Tag1
> word2 \t Tag2
> word3 \t Tag3
>
> Now the tags are complex and I wanted to split them in a tab-delimited
> fashion to have this:
>
> word1 \t Tag1Part1 \t Tag2Part2 \t Tag3Part3
>
> I searched online for some solution and found the code below which uses a
> dictionary to store the tags that I want to replace in keys and the
> sub-tags
> as values. The problem with this is that it sometimes replaces tags that
> are
> not surrounded by spaces, which I do not like to happen. Also, I wanted
> each
> new sub-tag to be followed by a tab, so that the new items that I end up
> having in my file are tab-delimited. For this, I put tabs between the items
> of each key in the dictionary. I started thinking that this will not be the
> best solution of the problem and perhaps a script that uses regular
> expressions would be better. Since I am new to Python, I thought I should
> ask you for your thoughts for a best solution. The items I want to replace
> are about 150 and I did not know how to iterate over them with regular
> expressions. Below is my previous code:
>
>
> #!usr/bin/python
>
> import re, sys
> f = file(sys.argv[1])
> readed= f.read()
>
> def replace_words(text, word_dic):
> for k, v in word_dic.iteritems():
> text = text.replace(k, v)
> return text
>
> # the dictionary has target_word:replacement_word pairs
>
> word_dic = {
> 'abbrev': 'abbrev null null',
> 'adj': 'adj null null',
> 'adv': 'adv null null',
> 'case_def_acc': 'case_def acc null',
> 'case_def_gen': 'case_def gen null',
> 'case_def_nom': 'case_def nom null',
> 'case_indef_acc': 'case_indef acc null',
> 'verb_part': 'verb_part null null'}
>
>
> # call the function and get the changed text
>
> myString = replace_words(readed, word_dic)
>
>
> fout = open(sys.argv[2], "w")
> fout.write(myString)
> fout.close()
>
> --dan
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mail.python.org/pipermail/tutor/attachments/20090503/bd82a183/attachment-0001.htm
> >
>
> ------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090504/623857d4/attachment.htm>
More information about the Tutor
mailing list