Request for Enhancement

Alex Martelli aleaxit at yahoo.com
Thu Aug 31 03:56:22 EDT 2000


"Samuel A. Falvo II" <kc5tja at garnet.armored.net> wrote in message
news:slrn8qrdnh.g2d.kc5tja at garnet.armored.net...
> I have need to process very large text files in Python, but I don't have
any
> idea how long the files are going to be in real-world situations.  It is
> unfortunate that there is no F.eof() function, where F is a Python file.

It's not necessarily unfortunate, because eof's semantics are
ambiguous.  You appear to desire it to be *predictive*: "if I
try reading more stuff, will that work?".  In many popular
languages, eof is, instead, *historical*: "have I already tried
to read more stuff, and already found out there WAS no more?"

Thus, eof tests are often semantic bugs in C/C++ programs
written by people not fully conversant with this fine point;
vice versa, languages which specify a predictive-eof (besides
the slight inefficiency) will confuse people with some C/C++
background (it may be unfortunate, but it's a reality, that
many programmers are exposed to C and/or C++ early on).

On a slightly more abstract basis, you appear to desire a
paradigm -- in pseudo-code:
    loop:
        if there_is_more_work_to_do():
            work_specification = get_more_work_data()
            do_some_more_work(work_specification)
        else:
            "terminate the loop"
which can be construed as a sub-case of the "getting
permission" pattern; what Python, C, C++, etc, offer, is
a very slightly different paradigm:
    loop:
        work_specification = get_more_work_data()
        if this_spec_is_valid(work_specification):
            do_some_more_work(work_specification)
        else:
            "terminate the loop"

I hope that putting the two approaches in these very
similar forms illuminates the similarity as well as
the difference.  This second tack can be seen as a
case of the "getting forgiveness" pattern: you ask
for a "work_specification" token anyway, but are quite
ready to receive as a result a token that means "no
more work to do"; in the first approach, you want to
FIRST check there actually IS more work to do, then,
get that work_specification token only if you know
in advance it's valid.

"It's easier to get forgiveness than permission" is
a mantra to remember (Admiral Murray Grace Hopper,
perhaps best known as the inventor of Cobol, coined
it, referring mostly to dealing with bureaucracy).

The very-similar 'get forgiveness' pattern is more
general.  Suppose this was a multi-threaded case,
for example: several working threads obtaining work
specs from a queue.  In this case, the 'get permission'
approach would give problems -- what stops two
threads from interleaving their get-permission
and obtain-work-spec requests, so that both get
convinced there is work to do... but there is in
fact only work enough for one?  In the alternative
'get permission' approach, get_more_work_data()
can be *atomic* -- the different threads obtain
different work-spec tokens, and if there's only
more more token of work to do in all, no problem
in one thread getting it, the other one getting the
token that encodes "that's it, no more work".


Beyond this, it may be an issue of syntax-sugar or
of handy encapsulation of stuff into objects.  But
the choice of pattern is more fundamental.


> Here's what I *want*:
>
> while not F.eof():
> l = F.readline()
> ...process line...

You THINK you want that, maybe because you come from
languages where this is own it's done (so you want the
predictive-eof).  If you came from C/C++, you would
probably think you want assignment-as-expression:

    while l = F.readline():
        process(l)

which (in this case) is usable syntax-sugar for the
get-forgiveness approach.


> As it is, I have to do the following:
>
> l_list = F.readlines() #note plural
> for line in l_list:
> ... process line ...
>
> While this is fine for my test cases, it could consume unacceptable
amounts
> of memory when fed large text files.

If the input file can indeed be many hundreds of megabytes,
more than the available memory, this is indeed not an
acceptable approach.  Note that processing files as huge
as this is rather an exception; browse through the several
gigabytes of disk of any typical machine and find out how
many TEXT files it holds that are larger than physical
memory (prediction: none; I think that the 0-1 range will
cover well over 99% of existing machines).

Still, it does happen, and it's well to be prepared even
for such exceptional contingencies.  Therefore, you don't
HAVE TO use .readlines() in Python.  If you start with
one or more file*names*, then:

import fileinput

for line in fineinput.input(filename):
    process(line)

is surely best.  Unfortunately, fileinput does not accept
already-opened file objects (maybe it should), so you get
to roll your own object encapsulation of your preferred
paradigm (maybe the standard Python library should provide
a convenience module with a few such design patterns
pre-cooked in it).

Perhaps most Pythonic is a sequence-protocol object, i.e.,
very minimally:

class lines_of:
    def __init__(self,fileobject):
        self.file=fileobject
    def __getitem__(self,index):
        line=self.file.readline()
        if not line: raise IndexError
        return line

and now:

for line in lines_of(F):
    process(line)

The __getitem__ is invoked by the for...in statement,
repeatedly, until an IndexError is raised, which means
"no more items", and is caught by the statement, which,
as a consequence, stops looping.

Getting the eof-test working does require more effort,
e.g. (untested code):

class file_with_eof:
    def __init__(self, fileobject):
        self.file = fileobject
        self.buff = None
    def eof(self):
        if self.buff == None:
            self.buff = self.file.readline()
        return self.buff == ''
    def readline(self):
        if self.buff == None:
            result = self.file.readline()
        else:
            result = self.buff
            self.buff = None
        return result

And now you can write:

F=file_with_eof(F)

while not F.eof():
    process(F.readline())


Is it worth to write a 15-lines wrapper class, in
order to implement a somewhat inferior paradigm, that
lets you express your application-logic in 3 lines?

Or is it better to write a 7-line wrapper class, using
a slightly better paradigm, to express your application
logic in 2 lines?

Your call.  Personally, I much prefer the latter way
of doing things (particularly as I might well have some
subtle bug in the slightly-subtler buffer/unbuffer logic
of the former way).  But, if you're really keen on being
able to test "predictively" for eof, you'll have to
simulate the "prediction" some way or other, so, the
file_with_eof approach may be your best bet.

I *do* think it interesting that Python lets you express
one or the other approach (and others yet) so cleanly,
clearly, and rather concisely.


Alex






More information about the Python-list mailing list