how do I express this perl idiom in python?

Thu Aug 3 04:37:07 EDT 2000

"Michael Vanier" <mvanier at endor.bbb.caltech.edu> wrote in message
news:jl7l9z9s1z.fsf at endor.bbb.caltech.edu...
>
> Hi,
>
> Perl allows you to redefine the end-of-line marker (using the special
> variable $/, which is a newline by default) into an arbitrary string, so
that
> you can grab everything up to and including that string.  For instance:
>
>     $/ = "foobar";
>     $line = <FILE>;
>
> will put all the file characters from the beginning of the file until the
> string "foobar" is reached into the variable "$line".  The "foobar" string
is
> also put into "$line", I believe (correct me if I'm wrong; I'm not a perl

Yes, this is basically right, I think (==it reflects my recollection of how
Perl
works, and I used to be quite experienced in Perl).

> programmer).  This can be useful when parsing HTML or XML files, for
> instance.

No, it would be a disaster to use it this way.  "parsers" written on such
a basis would routinely trip over quoted-strings which happen to contain
what looks like a closing-tag they're looking for, etc.

> Is there some not-too-painful way to do this in python, or in a python
> module?  I know about the HTMLParser class, but I want something more
> general.

If the file you want to read this way fits into memory, it's close to a
snap.
Just encapsulate the 'real' file-object into a class which implements a
readline method based on a settable string instance 'line_end'.  Only for
very large files, which might not fit into memory (or fit too tightly for
comfort), would you need to go to a more elaborate scheme based on
buffering chunks of the file at a time (slightly painful).

If the size of the file is not problematic...:

import string

class PerlishFile:
    def __init__(self, realfile,end='\n'):
        self.contents=realfile.read()
        self.line_end=end
    def readline(self):
            result_list=string.split(self.contents,self.line_end,1)
            if len(result_list)==1:
                self.contents=''
                return result_list[0]
            else:
                self.contents=result_list[1]
                return result_list[0]+self.line_end

A stylistical alternative is to treat the file-ending-condition
as an exception:

    def readline(self):
        try:
            result,self.contents=string.split(self.contents,self.line_end,1)
            return result+self.line_end
        except ValueError:
            result=self.contents
            self.contents=''
            return result

If you also want readlines(), you'll have to implement that, too
(either calling readline in a loop, which is simplest, or splitting
without the ,1 constraint at the end -- no doubt faster but you
will then have to loop appending the line_end string to each
item in the result, ensure that line is correctly appended or not
to the last item depending on whether the file ends that way
or not, etc; I'd take the simpler, lazy loop way:-).

Similarly, you can easily enhance __init__ to take either an
already-open file -- which is what I did here, but it does mean
you have to do:
    p=PerlishFile(open('foo.html'),'>')
or a string and open the file itself.  Etc, etc.  Still, I would not
call this too painful.  Incidentally, it's reasonably easy to
generalize this to regular expressions rather than strings
as line-end markers, but I'll leave that as an exercise...:-).

A very slight measure of pain enters the picture if you do need
to handle truly huge files too.  In this case, you'll have to read
by chunks, etc.  But, ask if you do have that need, and I'm sure
somebody will give it a try!-)

Do be sure not to use this for HTML parsing or XML parsing
or anything like that, though... you really wouldn't want quoted
strings &tc to trip you up as easily as this!-)

Alex