[Tutor] how to extract data only after a certain condition is met
bob gailer
bgailer at gmail.com
Sun Oct 10 23:28:09 CEST 2010
Emile beat me to it, but here goes anyway...
On 10/10/2010 3:35 PM, Josep M. Fontana wrote:
> Hi,
>
> First let me apologize for taking so long to acknowledge your answers
> and to thank you (Eduardo, Peter, Greg, Emile, Joel and Alan, sorry if
> I left anyone) for your help and your time.
>
> One of the reasons I took so long in responding (besides having gotten
> busy with some urgent matters related to my work) is that I was a bit
> embarrassed at realizing how poorly I had defined my problem.
> As Alan said, I should at least have told you which operations were
> giving me a headache. So I went back to my Python reference books to
> try to write some code and thus be able to define my problems more
> precisely. Only after I did that, I said to myself, I would come back
> to the list with more specific questions.
>
> The only problem is that doing this made me painfully aware of how
> little Python I know. Well, actually my problem is not so much that I
> don't know Python as that I have very little experience programming in
> general. Some years ago I learned a little Perl and basically I used
> it to do some text manipulation using regular expressions but that's
> all my experience. In order to learn Python, I read a book called
> "Beginning Python: From Novice to Professional" and I was hoping that
> just by starting to use the knowledge I had supposedly acquired by
> reading that book to solve real problems related to my project I would
> learn. But this turned out to be much more difficult than I had
> expected. Perhaps if I had worked through the excellent book/tutorial
> Alan has written (of which I was not aware when I started), I would be
> better prepared to confront this problem.
>
> Anyway (sorry for the long intro), since Emile laid out the problem
> very clearly, I will use his outline to point out the problems I'm having:
>
> Emile says:
> --------------
> Conceptually, you'll need to:
>
> -a- get the list of file names to change then for each
> -b- determine the new name
> -c- rename the file
>
> For -a- you'll need glob. For -c- use os.rename. -b- is a bit more
> involved. To break -b- down:
>
> -b1- break out the x-xx portion of the file name
> -b2- look up the corresponding year in the other file
> -b3- convert the year to the century-half structure
> -b4- put the pieces together to form the new file name
>
> For -b2- I'd suggest building a dictionary from your second files
> contents as a first step to facilitate the subsequent lookups.
>
> ---------------------
>
> OK. Let's start with -b- . My first problem is that I don't really
> know how to go about building a dictionary from the file with the
> comma separated values. I've discovered that if I use a file method
> called 'readlines' I can create a list whose elements would be each of
> the lines contained in the document with all the codes followed by
> comma followed by the year. Thus if I do:
>
> fileNameCentury =
> open(r'/Volumes/DATA/Documents/workspace/GCA/CORPUS_TEXT_LATIN_1/FileNamesYears.txt').readlines()
>
>
> Where 'FileNamesYears.txt' is the document with the following info:
>
> A-01, 1278
> A-02, 1501
> ...
> N-09, 1384
>
> I get a list of the
> form ['A-01,1374\rA-02,1499\rA-05,1449\rA-06,1374\rA-09, ...]
>
I'm guessing that you are running on a Linux system and that the file
came from a Mac. This is based on the fact that \r appears in the string
instead of acting as a line separator.
Regardless -
dct = {}
fileNameCentury = fileNameCentury.split('\r') # gives you ['A-01,1374',
'A-02,1499', 'A-05,1449', 'A-06,1374', 'A-09, ...]
for pair in fileNameCentury:
key,value = pair.split(',')
dct[key] = value
> Greg mentioned the csv module. I checked the references but I could
> not see any way in which I could create a dictionary using that module.
>
True - the csv reader is just another way to get the list of pairs.
>
>
> Once I have the dictionary built, what I would have to do is use the
> os module (or would it be the glob module?) to get a list of the file
> names I want to change and build another loop that would iterate over
> those file names and, if the first part of the name (possibly
> represented by a regular expression of the form r'[A-Z]-[0-9]+')
> matches one of the keys in the dictionary, then a) it would get the
> value for that key, b) would do the numerical calculation to determine
> whether it is the first part of the century or the second part and c)
> would insert the string representing this result right before the
> extension .txt.
>
> In the abstract it sounds easy, but I don't even know how to start.
> Doing some testing with glob I see that it returns a list of strings
> representing the whole paths to all the files whose names I want to
> manipulate. But in the reference documents that I have consulted, I
> see no way to change those names. How do I go about inserting the
> information about the century right before the substring '.txt'?
>
Suppose fn = "blah.txt"
fn2 = f
>
> As you see, I am very green. My embarrassment at realizing how basic
> my problems were made me delay writing another message but I decided
> that if I don't do it, I will never learn.
>
> Again, thanks so much for all your help.
>
> Josep M.
>
> Message: 2
> Date: Sat, 2 Oct 2010 17:56:53 +0200
> From: "Josep M. Fontana" <josep.m.fontana at gmail.com
> <mailto:josep.m.fontana at gmail.com>>
> To: tutor at python.org <mailto:tutor at python.org>
> Subject: [Tutor] Using contents of a document to change file names
> Message-ID:
> <AANLkTikjOFYhieL70E=-BaE_PEdc0nG+iGY3j+qO+FMZ at mail.gmail.com
> <mailto:BaE_PEdc0nG%2BiGY3j%2BqO%2BFMZ at mail.gmail.com>>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi,
>
> This is my first posting to this list. Perhaps this has a very
> easy answer
> but before deciding to post this message I consulted a bunch of Python
> manuals and on-line reference documents to no avail. I would be very
> grateful if someone could lend me a hand with this.
>
> Here's the problem I want to solve. I have a lot of files with the
> following
> name structure:
>
> A-01-namex.txt
> A-02-namey.txt
> ...
> N-09-namez.txt
>
> These are different text documents that I want to process for an
> NLP project
> I'm starting. Each one of the texts belongs to a different century
> and it is
> important to be able to include the information about the century
> in the
> name of the file as well as inside the text.
>
> Then I have another text file containing information about the
> century each
> one of the texts was written. This document has the following
> structure:
>
> A-01, 1278
> A-02, 1501
> ...
> N-09, 1384
>
> What I would like to do is to write a little script that would do the
> following:
>
> . Read each row of the text containing information about the
> centuries each
> one of the texts was written
> . Change the name of the file whose name starts with the code in
> the first
> column in the following way
>
> A-01-namex.txt --> A-01-namex_13-2.txt
>
> Where 13-1 means: 13th 2nd half. Obviously this information
> would com
> from the second column in the text: 1278 (the first two digits + 1 =
> century; if the 3rd and 4th digits > 50, then 2; if < 50 then 1)
>
> Then in the same script or in a new one, I would need to open each
> one of
> the texts and add information about the century they were written
> on the
> first line preceded by some symbol (e.g @13-2)
>
> I've found a lot of information about changing file names (so I
> know that I
> should be importing the os module), but none of the examples that
> were cited
> involved getting the information for the file changing operation
> from the
> contents of a document.
>
> As you can imagine, I'm pretty green in Python programming and I
> was hoping
> the learn by doing method would work. I need to get on with this
> project,
> though, and I'm kind of stuck. Any help you guys can give me will
> be very
> helpful.
>
> Josep M.
>
>
>
>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
--
Bob Gailer
919-636-4239
Chapel Hill NC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20101010/e824f5ee/attachment-0001.html>
More information about the Tutor
mailing list