[Tutor] how to extract data only after a certain condition is met

bob gailer bgailer at gmail.com
Sun Oct 10 23:28:09 CEST 2010


  Emile beat me to it, but here goes anyway...

On 10/10/2010 3:35 PM, Josep M. Fontana wrote:
> Hi,
>
> First let me apologize for taking so long to acknowledge your answers 
> and to thank you (Eduardo, Peter, Greg, Emile, Joel and Alan, sorry if 
> I left anyone) for your help and your time.
>
> One of the reasons I took so long in responding (besides having gotten 
> busy with some urgent matters related to my work) is that I was a bit 
> embarrassed at realizing how poorly I had defined my problem.
> As Alan said, I should at least have told you which operations were 
> giving me a headache. So I went back to my Python reference books to 
> try to write some code and thus be able to define my problems more 
> precisely. Only after I did that, I said to myself, I would come back 
> to the list with more specific questions.
>
> The only problem is that doing this made me painfully aware of how 
> little Python I know. Well, actually my problem is not so much that I 
> don't know Python as that I have very little experience programming in 
> general. Some years ago I learned a little Perl and basically I used 
> it to do some text manipulation using regular expressions but that's 
> all my experience. In order to learn Python, I read a book called 
> "Beginning Python: From Novice to Professional" and I was hoping that 
> just by starting to use the knowledge I had supposedly acquired by 
> reading that book to solve real problems related to my project I would 
> learn. But this turned out to be much more difficult than I had 
> expected. Perhaps if I had worked through the excellent book/tutorial 
> Alan has written (of which I was not aware when I started), I would be 
> better prepared to confront this problem.
>
> Anyway (sorry for the long intro), since Emile laid out the problem 
> very clearly, I will use his outline to point out the problems I'm having:
>
> Emile says:
> --------------
> Conceptually, you'll need to:
>
>   -a- get the list of file names to change then for each
>   -b- determine the new name
>   -c- rename the file
>
> For -a- you'll need glob. For -c- use os.rename.  -b- is a bit more
> involved.  To break -b- down:
>
>   -b1- break out the x-xx portion of the file name
>   -b2- look up the corresponding year in the other file
>   -b3- convert the year to the century-half structure
>   -b4- put the pieces together to form the new file name
>
> For -b2- I'd suggest building a dictionary from your second files
> contents as a first step to facilitate the subsequent lookups.
>
> ---------------------
>
> OK. Let's start with -b- . My first problem is that I don't really 
> know how to go about building a dictionary from the file with the 
> comma separated values. I've discovered that if I use a file method 
> called 'readlines' I can create a list whose elements would be each of 
> the lines contained in the document with all the codes followed by 
> comma followed by the year. Thus if I do:
>
> fileNameCentury = 
> open(r'/Volumes/DATA/Documents/workspace/GCA/CORPUS_TEXT_LATIN_1/FileNamesYears.txt').readlines() 
>
>
> Where 'FileNamesYears.txt' is the document with the following info:
>
> A-01, 1278
> A-02, 1501
> ...
> N-09, 1384
>
> I get a list of the 
> form ['A-01,1374\rA-02,1499\rA-05,1449\rA-06,1374\rA-09, ...]
>

I'm guessing that you are running on a Linux system and that the file 
came from a Mac. This is based on the fact that \r appears in the string 
instead of acting as a line separator.

Regardless -
dct = {}
fileNameCentury = fileNameCentury.split('\r') # gives you ['A-01,1374', 
'A-02,1499', 'A-05,1449', 'A-06,1374', 'A-09, ...]
for pair in fileNameCentury:
   key,value = pair.split(',')
   dct[key] = value

> Greg mentioned the csv module. I checked the references but I could 
> not see any way in which I could create a dictionary using that module.
>
True - the csv reader is just another way to get the list of pairs.
>
>
> Once I have the dictionary built, what I would have to do is use the 
> os module (or would it be the glob module?) to get a list of the file 
> names I want to change and build another loop that would iterate over 
> those file names and, if the first part of the name (possibly 
> represented by a regular expression of the form r'[A-Z]-[0-9]+') 
> matches one of the keys in the dictionary, then a) it would get the 
> value for that key, b) would do the numerical calculation to determine 
> whether it is the first part of the century or the second part and c) 
> would insert the string representing this result right before the 
> extension .txt.
>
> In the abstract it sounds easy, but I don't even know how to start. 
>  Doing some testing with glob I see that it returns a list of strings 
> representing the whole paths to all the files whose names I want to 
> manipulate. But in the reference documents that I have consulted, I 
> see no way to change those names. How do I go about inserting the 
> information about the century right before the substring '.txt'?
>
Suppose fn = "blah.txt"
fn2 = f
>
> As you see, I am very green. My embarrassment at realizing how basic 
> my problems were made me delay writing another message but I decided 
> that if I don't do it, I will never learn.
>
> Again, thanks so much for all your help.
>
> Josep M.
>
>     Message: 2
>     Date: Sat, 2 Oct 2010 17:56:53 +0200
>     From: "Josep M. Fontana" <josep.m.fontana at gmail.com
>     <mailto:josep.m.fontana at gmail.com>>
>     To: tutor at python.org <mailto:tutor at python.org>
>     Subject: [Tutor] Using contents of a document to change file names
>     Message-ID:
>     <AANLkTikjOFYhieL70E=-BaE_PEdc0nG+iGY3j+qO+FMZ at mail.gmail.com
>     <mailto:BaE_PEdc0nG%2BiGY3j%2BqO%2BFMZ at mail.gmail.com>>
>     Content-Type: text/plain; charset="iso-8859-1"
>
>     Hi,
>
>     This is my first posting to this list. Perhaps this has a very
>     easy answer
>     but before deciding to post this message I consulted a bunch of Python
>     manuals and on-line reference documents to no avail. I would be very
>     grateful if someone could lend me a hand with this.
>
>     Here's the problem I want to solve. I have a lot of files with the
>     following
>     name structure:
>
>     A-01-namex.txt
>     A-02-namey.txt
>     ...
>     N-09-namez.txt
>
>     These are different text documents that I want to process for an
>     NLP project
>     I'm starting. Each one of the texts belongs to a different century
>     and it is
>     important to be able to include the information about the century
>     in the
>     name of the file as well as inside the text.
>
>     Then I have another text file containing information about the
>     century each
>     one of the texts was written. This document has the following
>     structure:
>
>     A-01, 1278
>     A-02, 1501
>     ...
>     N-09, 1384
>
>     What I would like to do is to write a little script that would do the
>     following:
>
>     . Read each row of the text containing information about the
>     centuries each
>     one of the texts was written
>     . Change the name of the file whose name starts with the code in
>     the first
>     column in the following way
>
>            A-01-namex.txt --> A-01-namex_13-2.txt
>
>        Where 13-1 means: 13th 2nd half. Obviously this information
>     would com
>     from the second column in the text: 1278 (the first two digits + 1 =
>     century; if the 3rd and 4th digits > 50, then 2; if < 50 then     1)
>
>     Then in the same script or in a new one, I would need to open each
>     one of
>     the texts and add information about the century they were written
>     on the
>     first line preceded by some symbol (e.g @13-2)
>
>     I've found a lot of information about changing file names (so I
>     know that I
>     should be importing the os module), but none of the examples that
>     were cited
>     involved getting the information for the file changing operation
>     from the
>     contents of a document.
>
>     As you can imagine, I'm pretty green in Python programming and I
>     was hoping
>     the learn by doing method would work.  I need to get on with this
>     project,
>     though, and I'm kind of stuck. Any help you guys can give me will
>     be very
>     helpful.
>
>     Josep M.
>
>
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor


-- 
Bob Gailer
919-636-4239
Chapel Hill NC

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101010/e824f5ee/attachment-0001.html>


More information about the Tutor mailing list