[Tutor] Can't process all my files (need to close?)

Mon Sep 20 17:41:28 CEST 2010

On 20/09/2010 16:19, aeneas24 at priest.com wrote:
> My Python script needs to process 45,000 files, but it seems to blow
> up after about 10,000. Note that I'm outputting
> bazillions of rows to a csv, so that may be part of the issue.
>
> Here's the error I get (I'm running it through IDLE on Windows 7):
>
> Microsoft Visual C++ Runtime Library Runtime Error! Program:
> C:\Python26\pythonw.exe This application has requested the Runtime to
> terminate it in an usual way.

OK. Just for starters (and to eliminate other possible side-issues)
can you run the same code directly from the command line using
c:\python26\python.exe?

ie just open cmd.exe, cd to the directory where your code is and type:

c:\python26\python.exe <mystuff.py>

I assume that the same thing will happen, but if it doesn't then
that points the finger at IDLE or, possibly, at pythonw.exe. (And
might also give you a workaround).

> I think this might be because I don't specifically close the files
> I'm reading. Except that I'm not quite sure where to put the close.

On this note, it's worth learning about context managers. Or, rather,
the fact that files can be context-managed. That means that you can
open a file using the with ...: construct and it will automatically
close:

<code>

with open ("blah.csv", "wb") as f:
   f.write ("blah")

# at this point f will have been closed

</code>

> 1) During the self.string here:
>
> class ReviewFile: # In our movie corpus, each movie is one text file.
> That means that each text file has some "info" about the movie
> (genre, director, name, etc), followed by a bunch of reviews. This
> class extracts the relevant information about the movie, which is
> then attached to review-specific information. def __init__(self,
> filename): self.filename = filename self.string =
> codecs.open(filename, "r", "utf8").read() self.info =
> self.get_fields(self.get_field(self.string, "info")[0])
> review_strings = self.get_field(self.string, "review") review_dicts =
> map(self.get_fields, review_strings) self.reviews = map(Review,
> review_dicts)

So that could become:

<code>
with codecs.open (filename, "r", "utf8") as f:
   self.string = f.read ()

</code>

>
> 2) Maybe here? def reviewFile ( file, args): for file in
> glob.iglob("*.txt"): print "  Reviewing...." + file rf =
> ReviewFile(file)

Here, with that many files, I'd strongly recommend using the
FindFilesIterator exposed in the win32file module of the pywin32
extensions. glob.iglob simply does a glob (creating a moderately
large in-memory list) whose iterator it then returns. The
FindFilesIterator actually calls underlying Windows code to
iterate lazily over the files in the directory.

>
> 3) Or maybe here?
>
> def reviewDirectory ( args, dirname, filenames ): print
> 'Directory',dirname for fileName in filenames: reviewFile(
> dirname+'/'+fileName, args ) def
> main(top_level_dir,csv_out_file_name): csv_out_file  =
> open(str(csv_out_file_name), "wb") writer = csv.writer(csv_out_file,
> delimiter=',') os.path.walk(top_level_dir, reviewDirectory, writer )
> main(".","output.csv")

Again, here, you might use:

<code>
with open (str (csv_out_file_name), "wb") as csv_out_file:
   writer = csv.writer (csv_out_file, delimiter=",")

</code>

I'm fairly sure that the default delimiter is already "," so you
shouldn't need that, and I'm not sure where csv_out_file_name
is coming from but you almost certainly don't need to convert it
explicitly to a string.

Note, also, that the os.walk (*not* os.path.walk) function is often
an easier fit since it iterates lazily over directories, yielding a
(dirpath, dirnames, filenames) 3-tuple which you can then act upon.
However, you may prefer the callback style of the older os.path.walk.

TJG