[Tutor] Replacing fields in lines of various lengths

Tue May 5 12:06:48 CEST 2009

Le Tue, 5 May 2009 00:22:45 -0400,
Dan Liang <danliang20 at gmail.com> s'exprima ainsi:

> -------------Begin data----------------------------
> 
> w1    \t   case_def_acc   \t          yes
> w2‬    \t   noun_prop   \t               no
> ‭w3‬    \t   case_def_gen   \t
> w4    \t   dem_pron_f   \t             no
> w3‬    \t   case_def_gen   \t
> w4    \t   dem_pron_f   \t             no
> w1    \t   case_def_acc   \t          yes
> w3‬    \t   case_def_gen   \t
> w3‬    \t   case_def_gen   \t
> 
> -------------End data----------------------------

> I tried to  make changes to the code above by changing the function where we
> read the dictionary, but it did not work. While it is ugly, I include it as
> a proof that I have worked on the problem. I am sure you will have various
> nice ideas.
> 
> 
> -------------End code----------------------------
> def newlyTaggedWord(line):
>        tagging = ""
>        line = line.split(TAB)    # separate parts of line, keeping data only
>        if len(line)==3:
>            word = line[-3]
>            tag = line[-2]
>            new_tags = tags[tag]
>            decision = line[-1]
> 
> # in decision I wanted to store #either yes or no if one of #these existed
> 
>        elif len(line)==2:
>            word = line[-2]
>            tag = line[-1]
>            decision = TAB
> 
> # I thought if it is a must to put sth in decision while decision #is really
> absent in line, I would put a tab. But I really want to #avoid putting
> anything there.
> 
>            new_tags = tags[tag]          # read in dict
>            tagging = TAB.join(new_tags)    # join with TABs
>            return word + TAB + tagging + TAB + decision
> -------------End code----------------------------
> 

For simplicity, it would be cool if file would have some placeholder in place of absent yes/no 'decisions' so that you know there are always 3 fields. That's what would be cool with most languages. But python is rather flexible and clever for such border cases. Watch the example below:

s1, s2 = "1\t2\t3", "1\t2\t"
items1, items2 = s1.split('\t'), s2.split('\t')
print items1, items2
==>
['1', '2', '3'] ['1', '2', '']

So that you always have 3 items, the 3rd one maybe the empty string. Right?
This means:
* You can safely write "(word,tag,decision) = line.split(TAB)"
[Beware of misleading naming like "line = line.split(TAB)", for after this the name 'line' actually refers to field values.]
* You can have a single process.
* The elif branch in you code above will never run, i guess ;-)
[place a print instruction inside to check that]

Denis

Ps: I noticed that in your final version for the case of files with 2 fields only, you misplaced the file closings. They fit better in the func.
------
la vita e estrany