[Tutor] a quick Q: how to use for loop to read a series of files with .doc end

Wed Oct 5 14:21:58 CEST 2011

On 10/05/2011 02:51 AM, lina wrote:
> On Wed, Oct 5, 2011 at 1:42 PM, Dave Angel<d at davea.name>  wrote:
>
>> On 10/04/2011 11:13 PM, lina wrote:
>>
>>> On Wed, Oct 5, 2011 at 10:45 AM, Dave Angel<d at davea.name>   wrote:
>>>
>>>   On 10/04/2011 10:22 PM, lina wrote:
>>>>   On Wed, Oct 5, 2011 at 1:30 AM, Prasad, Ramit<ramit.prasad at jpmorgan.***
>>>>> *com<ramit.prasad at jpmorgan.com>
>>>>>
>>>>>> w
>>>>>>
>>>>> <SNIP>
>>>>> SyntaxError: invalid syntax
>>>>>
>>>>> for fileName in os.listdir("."):
>>>>>      if os.path.isfile(fileName) and os.path.splitext(fileName)[1]=****
>>>>>
>>>>> =".xpm":
>>>>>          filedata = open(fileName)
>>>>>          text=filedata.readlines()
>>>>>          cols = len(text[0])
>>>>>          except IndexError:
>>>>>              print ("Index Error.")
>>>>>          result=[]
>>>>>          for idx in xrange(cols):
>>>>>              results.append(0)
>>>>>          for line in text:
>>>>>              for col_idx, field in enumerate(line):
>>>>>                  if token in field:
>>>>>                      results[col_idx]+=1
>>>>>              for index in col_idx:
>>>>>                  print results[index]
>>>>>
>>>>> it showed up:
>>>>>
>>>>>      print results[]
>>>>>                  ^
>>>>> SyntaxError: invalid syntax
>>>>>
>>>>> Sorry, I am still lack deep understanding about something basic. Thanks
>>>>> for
>>>>> your patience.
>>>>>
>>>>>
>>>>>   Simplest answer here is you might have accidentally run this under
>>>>> Python
>>>>>
>>>> 3.x.  That would explain the syntax error on the print function.   Pick a
>>>> single version and stick to it.  In fact, you might even put a version
>>>> test
>>>> at the beginning of the code to give an immediate error.
>>>>
>>>>   choose python3.
>>>   Then change that last print to use parentheses.  print() is a function
>> call in Python 3.x, while it was a statement in earlier Python versions.
>>
>>   <SNIP>
>>>   This example illustrates one reason why it's a mistake to write all the
>>>> code at top level.  This code should probably be at least 4 functions,
>>>> with
>>>> each one handling one abstraction.
>>>>
>>>>   It's frustrating. Seriously. (I think I need to read some good
>>> (relevant)
>>> codes first.
>>>
>>>   Is Python your first programming language?  It was approximately my 30th.
> Not exactly. Ha ... I don't know there are so many languages there.
>
>> I learned "programming" from a Fortran book in 1967.  I had no access to a
>> computer, though there was at least one in the state, at the Yale campus.  I
>> saw it in a field trip by the (advanced) students that were taking
>> programming.  They weren't allowed to take it till finishing 2nd year
>> calculus, which I didn't do till I got to college.  However, when I went to
>> college the following year, I ran across another student who knew how to
>> access the mainframe (via punch-cards), and could tell me how to do it.
>>   (Security was very light).  For a few months, I hacked daily, and learned a
>> lot.  Then the following year, I actually took an electrical engineering
>> class that introduced the concepts of programming, and I spent my time doing
>> experiments that barely resembled the assignments.  I ended up with an
>> incomplete in the course, which I made up by writing a linear circuit
>> analysis program.  Punched card input, graphical output to a line printer
>> using rows of asterisks.
>>
> How to start, I learned C 10 years ago, but for whole semester, I never
> wrote a serious program, but indeed attended every lecture.
> At that time, I was addicted literature staff. But later realized that lots
> of writers (especially the ones I like)  ended up with committing suicide,
> something to heavy to handle, so I changed to something like physics, I
> noted lots of people doing physics living really long and happy (long living
> the physicist), then four years as (applied) physics, three years as
> (theoretical) physics, then (bio-) physics in the following years. (It's a
> joke).
> During those years used maple, matlab and some basic awk, bash. but all is
> very basic. shame...did not do something seriously.
>
>> Point is, it takes a lot of time, and usually a one-on-one mentor to get
>> the concepts nailed down.  Seldom did anyone tell me "write these lines
>> down, and it'll solve the problem."  instead they told me where my problem
>> was, and where in those manuals (chained to tables in the lab) to find more
>> information.
>>
>> It wasn't till my fourth language that I found out about local variables,
>> and how a function should encapsulate one concept.  The first three didn't
>> have such things.
>>
>>
>>
>>   Further, while you're developing, you should probably put the test data
>>>> into a literal (probably a multiline literal using triplequotes), so you
>>>> can
>>>> experiment easily with changes to the data, and see how it results.
>>>>
>>>>
>>>   #!/bin/python
>>>
>>> import os.path
>>>
>>> tokens=['B','E']
>>>
>>> for fileName in os.listdir("."):
>>>      if os.path.isfile(fileName) and os.path.splitext(fileName)[1]=**
>>> =".xpm":
>>>          filedata = open(fileName)
>>>          text=filedata.readlines()
>>>          results={}
>>>          numcolumns=len(text.strip())
>>>          for ch in tokens:
>>>              results[ch]=[0]*numcolumns
>>>          for line in text:
>>>              for col, ch in enumerate(line):
>>>                  if ch in tokens:
>>>                      results[ch][col]+=1
>>>          for item in results:
>>>                  print item
>>>
>>> $ python3 counter-vertically.py
>>>    File "counter-vertically.py", line 20
>>>      print item
>>>               ^
>>> SyntaxError: invalid syntax
>>>
>>>   As I said above, Python 3 needs parentheses around print's argument list.
>> As for splitting into functions, consider:
>>
>>
>> #these two are capitalized because they're intended to be constant
>> TOKENS = "BE"
>> LINESTOSKIP = 43
>> INFILEEXT = ".xpm"
>> OUTFILEEXT = ".txt"
>>
>> def dofiles(topdirectory):
>>     for filename in os.listdr(topdirectory):
>>         processfile(filename)
>>
>> def processfile(infilename):
>>     base, ext =os.path.splitext(fileName)
>>     if ext == INFILEEXT:
>>         text = fetchonefiledata(infilename)
>>         numcolumns = len(text[0])
>>         results = {}
>>         for ch in TOKENS:
>>
>>             results[ch] = [0] * numcolumns
>>         for line in text:
>>             line = line.strip()
>>
>>             for col, ch in enumerate(line):
>>                 if ch in tokens:
>>                     results[ch][col] += 1
>>         writeonefiledata(base+**OUTFILEEXT, results)
>>
>> def fetchonefiledata(inname):
>>     infile = open(inname)
>>     text = infile.readlines()
>>     return text[LINESTOSKIP:]
>>
>> def writeonefiledata(outname):
>>     outfile = open(outname, "w")
>>     ...process the results as appropriate...
>>     ....(since you didn't tell us how multiple tokens were to be displayed)
>>
>> if __name__ == "__main__":
>>     dofiles(".")     #or get the top directory from the sys.argv variable,
>> which is set from command line.
>>
>>
>> You dissect the former one you suggested before into 4 functions.
> a little question, why choose .ext? why the splitext is also ext here?
>
>
>
Try the following, perhaps in the interpreter:

mytuple = ("one thing", "Another thing")
base, extension = mytuple

Now look and see what base and extension have for values.

Previously we just needed the second element of the splitext return 
value.  This time we'll need both, so might as well put them in 
variables that have  useful names.

>> Now this is totally untested.  I just typed it without even trying any of
>> it.
>
>
> import os.path
>
>
> TOKENS="E"
> LINESTOSKIP=0
> INFILEEXT=".xpm"
> OUTFILEEXT=".txt"
>
> def dofiles(topdirectory):
>      for filename in os.listdir(topdirectory):
>          processfile(filename)
>
> def processfile(infilename):
>      base, ext =os.path.splitext(infilename)
>      if ext == INFILEEXT:
>          text = fetchonefiledata(infilename)
>          numcolumns=len(text[0])
>          results={}
>          for ch in TOKENS:
>
>              results[ch] = [0]*numcolumns
>          for line in text:
>              line = line.strip()
>
>              for col, ch in enumerate(line):
>                  if ch in TOKENS:
>                      results[ch][col]+=1
>          writeonefiledata(base+OUTFILEEXT,results)
>
> def fetchonefiledata(inname):
>      infile = open(inname)
>      text = infile.readlines()
>      return text[LINESTOSKIP:]
>
> def writeonefiledata(outname,results):
>      outfile = open(outname,"w")
>      for item in results:
>          return outfile.write(item)
>
>
> if __name__=="__main__":
>      dofiles(".")
>
> just the results is a bit unexpected.
>
>   $ more try.txt
> E
>
> I might make a mistake in the writeonefiledata your left part.
>
I'd be amazed if there weren't at least a couple of typos in my 
message.  But this is where you sprinkle a couple of prints.  What did 
results look like when you print it out?

I hope you'll find that results is a dictionary, you might not want to 
just write() its keys.  You probably want to write() its values instead, 
perhaps with a heading showing what key you're printing.

> But it gives you a simple refactoring that splits the logic so each can be
>> visualized (and tested) independently.  i'd also split up processfile(),
>> once I realized how big it was.
>>
>> There are many shortcuts that can be applied. Some of them probably use
>> language features you're not comfortable with, like perhaps generators.  And
>> if  efficiency is important, there are optimizations to do, like using
>> islice directly on the infile object.  That one would eliminate having to
>> have the whole file stored in memory at one time.
>>
>> Likewise there are further things that could be done to decouple the
>> functions even more.
>>
>> But there's nothing in the above code which uses very advanced topics, so
>> you should be able to understand it and fix whatever typos I've undoubtedly
>> got.
>>
>> What are you using for debugging aids?  Besides this group, I mean.  print
>> statements?  An IDE ?  which one?
>>
> debugging aids?
> I just run python3 script.py
> it will pop up some hints,
> in the middle, probably try print.
>
Once the code is refactored into small enough independent functions, you 
can do things like write multiple versions of a given function, for 
debugging purposes.  For example, you could have another function 
called  fetchonefiledata(), and have it return a list of strings.  For 
example, it might be

def fetchonefiledata(dummy):
     buf = """EEDC
AAAC
F145
CCCA
"""
     return buf.split()

and then you wouldn't be dependent on an actual file being available.

Naturally, at that point, your top-level code would call processfiles() 
instead of dofile().

And remember the repr() and type() functions when trying to see just 
what type of thing something is.
.
> Thanks for your time,
>

You're certainly welcome.

-- 

DaveA