[Tutor] Iterating over a long list with regular expressions and changing each item?

Mon May 4 11:09:16 CEST 2009

Le Sun, 3 May 2009 21:59:23 -0400,
Dan Liang <danliang20 at gmail.com> s'exprima ainsi:

> Hi tutors,
> 
> I am working on a file and need to replace each occurrence of a certain
> label (part of speech tag in this case) by a number of sub-labels. The file
> has the following format:
> 
> word1  \t    Tag1
> word2  \t    Tag2
> word3  \t    Tag3
> 
> Now the tags are complex and I wanted to split them in a tab-delimited
> fashion to have this:
> 
> word1   \t   Tag1Part1   \t   Tag2Part2   \t   Tag3Part3
> 
> I searched online for some solution and found the code below which uses a
> dictionary to store the tags that I want to replace in keys and the sub-tags
> as values. The problem with this is that it sometimes replaces tags that are
> not surrounded by spaces, which I do not like to happen*1*. Also, I wanted
> each new sub-tag to be followed by a tab, so that the new items that I end
> up having in my file are tab-delimited. For this, I put tabs between the
> items of each key in the dictionary*2*. I started thinking that this will
> not be the best solution of the problem and perhaps a script that uses
> regular expressions would be better*3*. Since I am new to Python, I thought
> I should ask you for your thoughts for a best solution. The items I want to
> replace are about 150 and I did not know how to iterate over them with
> regular expressions.

*3* I think regular expressions are not the proper tool here. Because you are knew and it's really hairy. But above all because they help parsing, not rewriting. Here the input is very simple, while you have some work for the replacement function.

*1* If the source really looks like above, then as I understand it, "tags that are
not surrounded by spaces" can only occur in words (eg the word 'noun'). On more reason for not using regex. You just need to read each line, keep the left part unchanged an cope with the tag. An issue is that you replace tags "blindly", without taking into account the easy structure of the source -- which would help you.

*2* I would rather have a dict which values are lists of (sub)tags. Then let a replacement function cope with output formatting.
word_dic = {
'abbrev': ['abbrev, null, null'],
'adj': ['adj, null, null'],
'adv': ['adv, null, null'],
...
}
It's not only cleaner, it lets you modify formatting at will. The dict is only constant *data*. Separating data from process is good practice.

I would do something like (untested):

tags = {......, 'foo':['foo1','foo2,'foo3'],..........}	# tag dict
TAB = '\t'

def newlyTaggedWord(line):
	(word,tag) = line.split(TAB)	# separate parts of line, keeping data only
	new_tags = tags['tag']		# read in dict
	tagging = TAB.join(new_tags)	# join with TABs
	return word + TAB + tagging	# formatted result

def replaceTagging(source_name, target_name):
	source_file = file(source_name, 'r')
	source = source_file.read()		# not really necessary
	target_file = open(target_name, "w")
	# replacement loop
	for line in source:
		new_line = newlyTaggedWord(line) + '\n'
		target_file.write(new_line)
	source_file.close()
	target_file.close()

if __name__ == "__main__"	
	source_name = sys.argv[1]
	target_name = sys.argv[2]
	replaceTagging(source_name, target_name)

> Below is my previous code:
> 
> 
> #!usr/bin/python
> 
> import re, sys
> f = file(sys.argv[1])
> readed= f.read()
> 
> def replace_words(text, word_dic):
>     for k, v in word_dic.iteritems():
>         text = text.replace(k, v)
>     return text
> 
> # the dictionary has target_word:replacement_word pairs
> 
> word_dic = {
> 'abbrev': 'abbrev    null    null',
> 'adj': 'adj    null    null',
> 'adv': 'adv    null    null',
> 'case_def_acc': 'case_def    acc    null',
> 'case_def_gen': 'case_def    gen    null',
> 'case_def_nom': 'case_def    nom    null',
> 'case_indef_acc': 'case_indef    acc    null',
> 'verb_part': 'verb_part    null    null'}
> 
> 
> # call the function and get the changed text
> 
> myString = replace_words(readed, word_dic)
> 
> 
> fout = open(sys.argv[2], "w")
> fout.write(myString)
> fout.close()
> 
> --dan

------
la vita e estrany