[Tutor] Problems with partial string matching

Dave Angel davea at ieee.org
Mon Nov 1 14:38:03 CET 2010


On 11/1/2010 8:48 AM, Josep M. Fontana wrote:
> Thanks a lot Dave and Joel,
>
>
>> You call re.sub(), but don't do anything with the result.
>>
>> Where do you call os.rename() ?
>
> Yes, indeed, as you suggested what was missing was the use of
> os.rename() to apply the substitution to the actual file names. I
> incorporated that and I changed the loop that I had produced in my
> first version because it wasn't doing what it was supposed to do.
>
> Doing that definitely gets me closer to my goal but I'm encountering a
> strange problem. Well, strange to me, that is. I'm sure that more
> experienced programmers like the people who hang out in this list will
> immediately see what is going on. First, here's the code:
>
> ------------------------------
> import os, sys, glob, re
> #What follows creates a dictionary with the form {'name':'year'} out
> of a csv file called FileNameYear.txt which has a string of the form
> 'A-01,1374' on each line. The substring before the comma is the code
> for the text that appears at the beginning of the name of the file
> containing the given text and the substring after the comma indicates
> the year in which the text was written.
> fileNameYear = open(r'/Volumes/DATA/Documents/workspace/MyCorpus/CORPUS_TEXT_LATIN_1/FileNameYear.txt',
> "U").readlines()
> name_year = {}
> for line in fileNameYear: #File objects have built-in iteration
>      name, year = line.strip().split(',')
>      name_year[name] = year #effectively creates the dictionary by
> creating keys with the element 'name' returned by the loop and
> assigning them values corresponding to the element 'year' -->  !d[key]
> = value" means Set d[key] to value.
> os.getcwd()
> os.chdir('/Volumes/DATA/Documents/workspace/MyCorpus/CORPUS_TEXT_LATIN_1')
> file_names = glob.glob('*.txt')
> for name in name_year:
>      for name_of_file in file_names:
>          if name_of_file.startswith(name):
>              os.rename(name_of_file, re.sub('__', '__' + year, name_of_file))
> ---------------
>
> What this produces is a change in the names of the files which is not
> exactly the desired result. The new names of the files have the
> following structure:
>
> 'A-01-name1__1499.txt' , 'A-02-name2__1499.txt',
> 'A-05-name3__1499.txt', ... 'I-01-name14__1499.txt',
> ...Z-30-name1344__1499.txt'
>
> That is, only the year '1499' of the many possible years has been
> added in the substitution. I can understand that I've done something
> wrong in the loop and the iteration over the values of the dictionary
> (i.e. the strings representing the years) is not working properly.
> What I don't understand is why precisely '1499' is the string that is
> obtained in all the cases.
>
> I've been trying to figure out how the loop proceeds and this doesn't
> make sense to me because the year '1499' appears as the value for
> dictionary item number 34. Because of the order of the dictionary
> entries and the way I've designed the loop (which I admit might not be
> the most efficient way to process these data), the first match would
> correspond to a file that starts with the initial code 'I-02'. The
> dictionary value for this key is '1399', not '1499'. '1499' is not
> even the value that would correspond to key 'A-01' which is the first
> file in the directory according to the alphabetical order ('A-02', the
> second file in the directory does correspond to value '1499', though).
>
> So besides being able to explain why '1499' is the string that winds
> up added to the file name, my question is, how do I set up the loop so
> that the string representing the appropriate year is added to each
> file name?
>
> Thanks a lot in advance for your help (since it usually takes me a
> while to answer).
>
> Josep M.
>
(You top-posted, so I had to remove the out-of-order earlier portion.)

I've not tried to run the code, but I think I can see the problem.  
Since you never assign 'year' inside the loop(s), it's always the same.  
And it's whatever the last value it had in the earlier loop.

The simplest cure would be to fix the outer loop

for name, year in name_year.items():

Alternatively, and maybe easier to read:

for name in name_year:
      year = name_year[name]

HTH
DaveA



More information about the Tutor mailing list