[Tutor] Iterating over a long list with regular expressions and changing each item?

Dan Liang danliang20 at gmail.com
Mon May 4 16:15:35 CEST 2009


Hi Spir and tutors,

Thank you Spir for your response. I went ahead and tried your code after
adding a couple of dictionary entries, as below:
-----------Code Begins---------------
#!usr/bin/python

tags = {


 'case_def_gen':['case_def','gen','null'],
 'nsuff_fem_pl':['nsuff','null', 'null'],
 'abbrev': ['abbrev, null, null'],
 'adj': ['adj, null, null'],
 'adv': ['adv, null, null'],} # tag dict
TAB = '\t'

def newlyTaggedWord(line):
       (word,tag) = line.split(TAB)    # separate parts of line, keeping
data only
       new_tags = tags['tag']     # read in dict--Index by string

       tagging = TAB.join(new_tags)    # join with TABs
       return word + TAB + tagging     # formatted result

def replaceTagging(source_name, target_name):
       source_file = file(source_name, 'r')
       source = source_file.read()       # not really necessary
       target_file = open(target_name, "w")
       # replacement loop
       for line in source:
               new_line = newlyTaggedWord(line) + '\n'
               target_file.write(new_line)
       source_file.close()
       target_file.close()

if __name__ == "__main__":
       source_name = sys.argv[1]
       target_name = sys.argv[2]
       replaceTagging(source_name, target_name)

-----------Code Ends---------------

The file I am working on looks like this:


  word      \t     case_def_gen
  word      \t     nsuff_fem_pl
  word      \t     adj
  word      \t     abbrev
  word      \t     adv

I get the following error when I try to run it, and I cannot figure out
where the problem lies:

-----------Error Begins---------------

Traceback (most recent call last):
  File "tag.formatter.py", line 36, in ?
    replaceTagging(source_name, target_name)
  File "tag.formatter.py", line 28, in replaceTagging
    new_line = newlyTaggedWord(line) + '\n'
  File "tag.formatter.py", line 16, in newlyTaggedWord
    (word,tag) = line.split(TAB)    # separate parts of line, keeping data
only
ValueError: unpack list of wrong size

-----------Error Ends---------------

Any ideas?

Thank you!

--dan


From: Dan Liang <danliang20 at gmail.com>
Subject: [Tutor] Iterating over a long list with regular expressions
       and     changing each item?
To: tutor at python.org
Message-ID:
       <a0e59afb0905031859k1d54bddck91955eb5b90ae501 at mail.gmail.com
>
> >
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi tutors,
>
> I am working on a file and need to replace each occurrence of a certain
> label (part of speech tag in this case) by a number of sub-labels. The file
> has the following format:
>
> word1  \t    Tag1
> word2  \t    Tag2
> word3  \t    Tag3
>
> Now the tags are complex and I wanted to split them in a tab-delimited
> fashion to have this:
>
> word1   \t   Tag1Part1   \t   Tag2Part2   \t   Tag3Part3
>
> I searched online for some solution and found the code below which uses a
> dictionary to store the tags that I want to replace in keys and the
> sub-tags
> as values. The problem with this is that it sometimes replaces tags that
> are
> not surrounded by spaces, which I do not like to happen. Also, I wanted
> each
> new sub-tag to be followed by a tab, so that the new items that I end up
> having in my file are tab-delimited. For this, I put tabs between the items
> of each key in the dictionary. I started thinking that this will not be the
> best solution of the problem and perhaps a script that uses regular
> expressions would be better. Since I am new to Python, I thought I should
> ask you for your thoughts for a best solution. The items I want to replace
> are about 150 and I did not know how to iterate over them with regular
> expressions. Below is my previous code:
>
>
> #!usr/bin/python
>
> import re, sys
> f = file(sys.argv[1])
> readed= f.read()
>
> def replace_words(text, word_dic):
>    for k, v in word_dic.iteritems():
>        text = text.replace(k, v)
>    return text
>
> # the dictionary has target_word:replacement_word pairs
>
> word_dic = {
> 'abbrev': 'abbrev    null    null',
> 'adj': 'adj    null    null',
> 'adv': 'adv    null    null',
> 'case_def_acc': 'case_def    acc    null',
> 'case_def_gen': 'case_def    gen    null',
> 'case_def_nom': 'case_def    nom    null',
> 'case_indef_acc': 'case_indef    acc    null',
> 'verb_part': 'verb_part    null    null'}
>
>
> # call the function and get the changed text
>
> myString = replace_words(readed, word_dic)
>
>
> fout = open(sys.argv[2], "w")
> fout.write(myString)
> fout.close()
>
> --dan
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mail.python.org/pipermail/tutor/attachments/20090503/bd82a183/attachment-0001.htm
> >
>
> ------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090504/623857d4/attachment.htm>


More information about the Tutor mailing list