[Python-Dev] MacPython and line-endings

Chris Barker chrishbarker@home.net
Tue, 18 Sep 2001 11:47:36 -0700


This is a multi-part message in MIME format.
--------------54C7C351FE079B794391A813
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Jack Jansen wrote:

> - On input, unix line-endings are now acceptable for all text files. This
>   is an experimental feature (awaiting a general solution, for which a
>   PEP has been promised but not started yet, the giulty parties know who
>   they are:-), and it can be turned off with a preference.

Jack, 

I don't know if I qualify as one of the "guilty" parties, but I did
volunteer to help with a PEP about this, and I'd still like to. I do
have some ideas about what I'd like to see in that PEP.

The one thing I have done is write a prototype in pure Python for how I
would like platform neutral text files to work. I've enclosed it with
this message, and invite comments.

Has anyone started this PEP yet? if so, I'd like to help, if not, then
the following is a very early draft of my thoughts. Note that I am
writting this from memory, without going back to the archives to see
what all the comments were at the time. I will do that before I call
this a PEP. 

Here are my quick thoughts:

This started (the recent thread, anyway) with the need for MacPython
(with the introduction of OS-X) to be able to read both traditional mac
style text files and unix style text files. An import-hook was
suggested, but then it was brought up that a lot of python code can be
read in other ways than an import, from execfile(), and a whole lot of
others, so an imprt hook would not be enough. In general, the problem
stems from the fact that while Python knows what system it is running
on, a file that is being read may or may not be on that same system.
This is most agregeuos with OS-X as you essentially have both Unix and
MacOS running on the same machine at the same time, often sharing a file
system. The issue also comes up with heterogeneous networks, where the
file might reside on a server running on a different system than Python,
and that file may be accessed by various systems. Some servers can do
line feed translation on the fly, but this is not universal or
foolproof.

In addition to Python code, many Python programs need to read and write
text files that are not in a native format, and the format may not be
known by the programmer when the code is writen.

My proposed solution to these problems is to have a new type of file: a
"Universal" text file. This would be a text file that would do line-feed
translation to the internal representation on the fly as the file was
being read (like the current text file type), but it would translate any
of the known text file formats automatically (\r\n, \r, \n Any
others???). When the file was being written to, a single terminator
would have to be specified, defaulting to the native one, or in the case
of a file opened for appending, perhaps the one in the file when it is
opened. The user could specify a non-native terminator when openign a
file for writing.

Issues:

The two big issues that came up in the discussion were backward
compatability and performance:

1) The python open() function currently defaults to a text file type.
However, on Posix systems, there is no difference between a text file
and a binary file, so many programmers writing code that is designed to
run only on such systems left the "b" flag off when opening files for
binary reading and writing. If the behaviour of a file opened without
the binary flag were to change, a lot of code would break. 

2) In recent versions of Python, a lot of effort was put into improving
performance of line oriented text file reading. These optimisations
require the use of native line endings. In order to get similar
performance with non-native endings, some portions of the C stdio
library would have to be re-written. This is a major undertaking, and no
one has stepped up to volunteer.

The proposed solution to both of these problems is to introduce a new
flag to the open() function: "t". If the "t" flag is present, the
function returns a Universal Text File, rather than a standard text
file. As this is a new flag, no old code should be broken. The default
would return a standard text file with the current behaviour. This would
allow the implimentation to be written in a way that was robust, but
perhaps not have optimum performance. If performance were critical, a
programmer could always use the old style text file. If, at some point,
code is written that allows the performance of Universal Text Files to
approach that of standard text files, perhaps the two could be merged.
It is unfortunate that the default would be the performance-optimised
but less generally useful case, but that is a reasonable price to be
paid for backward compatability. Perhaps the default could be changed at
some point in the future when other incompatabilities are introduced
(Python 3?)

In the case of Python code being read, performance of the file read is
unlikely to be critical to the performance of the application as a
whole.


Issues / questions:

Some systems, (VMS ?) store text files in the file system as a series of
lines, rather than just a string of bytes like most common systems
today. It would take a little more code to accomidate this, but it could
be done. 

Should a file being read be required to have a single line termination
type, or could they be mixed and matched? The prototype code allows mix
and match, but I'm not married to that idea. If it requires a single
terminator, then some performance could be gained by checking the
terminator type when opening the file, and using the existing native
text file code when it is a native file.


Others Issues???


I'd love to hear all your feedback on this write-up, as well as my code.
Please either CC me or the MacPython list, as I'm not subscribed to
python-dev


-Chris





-- 
Christopher Barker,
Ph.D.                                                           
ChrisHBarker@home.net                 ---           ---           ---
http://members.home.net/barkerlohmann ---@@       -----@@       -----@@
                                   ------@@@     ------@@@     ------@@@
Oil Spill Modeling                ------   @    ------   @   ------   @
Water Resources Engineering       -------      ---------     --------    
Coastal and Fluvial Hydrodynamics --------------------------------------
------------------------------------------------------------------------
--------------54C7C351FE079B794391A813
Content-Type: text/plain; charset=us-ascii;
 name="TextFile.py"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="TextFile.py"

#!/usr/bin/env python

"""

TextFile.py : a module that provides a UniversalTextFile class, and a
replacement for the native python "open" command that provides an
interface to that class.

It would usually be used as:

from TextFile import open

then you can use the new open just like the old one (with some added flags and arguments)

or

import TextFile

file = TextFile.open(filename,flags,[bufsize], [LineEndingType], [LineBufferSize])

please send bug reports, helpful hints,  and/or feature requests to:

Chris Barker

ChrisHBarker@home.net


Copyright/licence is the same as whatever version of python you are running.

"""
import os

## Re-map the open function
_OrigOpen = open

def open(filename,flags = "",bufsize = -1, LineEndingType = "", LineBufferSize = ""):
    """
    
    A new open function, that returns a regular python file object for
    the old calls, and returns a new nifty universal text file when
    required.

    This works just like the regular open command, except that a new
    flag and a new parameter has been added.

    The new flag is "t" which indicates that the file to be opened is a
    universal text file. While the standard open() function defaults to
    a text file, on Posix systems, there is no difference between a text
    file and binary fiole so there is a lot of code out there that opens
    files as text, when a binary file is really required. This code
    currently works just fine on Posix systems, so it was neccessary to
    introduce a new flag, to maintian backward compatabilty. The old
    style, line ending dpeendent text file with also provide better
    performance.
    

    To Call:

    file = open(filename,flags = "",bufsize = -1, LineEndingType = ""):

    - filename is the name of the file to be opened
    - flags is a string of one letter flags, the same as the standard open
      command, plus a "t" for universal text file.
    - - "b" means binary file, this returns the standard binary file object
    - - "t" means universal text file
    - - "r" for read only
    - - "w" for write. If there is both "w" and "t" than the user can
        specify a line ending type to be used with the LineEndingType
        parameter.
    - - "a" means append to existing file

    - bufsize specifies the buffer size to be used by the system. Same
      as the regular open function

    - LineEndingType is used only for writing (and appending) files, to specify a
      non-native line ending to be written.
    - - The options are: "native", "DOS", "Posix", "Unix", "Mac", or the
        characters themselves( "\r\n", etc. ). "native" will result in
        using the standard file object, which uses whatever is native
        for the system that python is running on.

    - LineBufferSize is the size of the buffer used to read data in
    a readline() operation. The default is currently set to 200
    characters. If you will be reading files with many lines over 200
    characters long, you should set this number to the largest expected
    line length.

    NOTE: I'm sure the flag checking could be more robust.
    
    """

    if "t" in flags: # this is a universal text file
        if ("w" in flags) and (not "w+" in flags) and LineEndingType == "native":
            return _OrigOpen(filename,flags.replace("t",""), bufsize)
        return UniversalTextFile(filename,flags,LineEndingType,LineBufferSize)
    else: # this is a regular old file
        return _OrigOpen(filename,flags,bufsize)
    
    
class UniversalTextFile:
    """
    
    A class that acts just like a python file object, but has a mode
    that allows the reading of arbitrary formated text files, i.e. with
    either Unix, DOS or Mac line endings. [\n , \r\n, or \r]

    To keep it truly universal, it checks for each of these line ending
    possibilities at every line, so it should work on a file with mixed
    endings as well.

    """
    def __init__(self,filename,flags = "",LineEndingType = "native",LineBufferSize = ""):
        self._file = _OrigOpen(filename,flags.replace("t","")+"b")

        LineEndingType = LineEndingType.lower()
        if LineEndingType == "native":
            self.LineSep = os.linesep()
        elif LineEndingType == "dos":
            self.LineSep = "\r\n"
        elif LineEndingType == "posix" or LineEndingType == "unix" :
            self.LineSep = "\n"
        elif LineEndingType == "mac":
            self.LineSep = "\r"
        else:
            self.LineSep = LineEndingType
        
        ## some attributes
        self.closed = 0
        self.mode = flags
        self.softspace = 0
        if LineBufferSize:
            self._BufferSize = LineBufferSize
        else:
            self._BufferSize = 100

    def readline(self):
        start_pos = self._file.tell()
        ##print "Current file posistion is:", start_pos
        line = ""
        TotalBytes = 0
        Buffer = self._file.read(self._BufferSize)
        while Buffer:
            ##print "Buffer = ",repr(Buffer)
            newline_pos = Buffer.find("\n")
            return_pos  = Buffer.find("\r")
            if return_pos == newline_pos-1 and return_pos >= 0: # we have a DOS line
                line = Buffer[:return_pos]+ "\n"
                TotalBytes = newline_pos+1
                break
            elif ((return_pos < newline_pos) or newline_pos < 0 ) and return_pos >=0: # we have a Mac line
                line = Buffer[:return_pos]+ "\n"
                TotalBytes = return_pos+1
                break
            elif newline_pos >= 0: # we have a Posix line
                line = Buffer[:newline_pos]+ "\n"
                TotalBytes = newline_pos+1
                break
            else: # we need a larger buffer
                NewBuffer = self._file.read(self._BufferSize)
                if NewBuffer:
                    Buffer = Buffer + NewBuffer
                else: # we are at the end of the file, without a line ending.
                    self._file.seek(start_pos + len(Buffer))
                    return Buffer

        self._file.seek(start_pos + TotalBytes)
        return line

    def readlines(self,sizehint = None):
        """

        readlines acts like the regular readlines, except that it
        understands any of the standard text file line endings ("\r\n",
        "\n", "\r").

        If sizehint is used, it will read a a maximum of that many
        bytes. It will never round up, as the regular readline sometimes
        does. This means that if your buffer size is less than the
        length of the next line, you'll get an empty string, which could
        incorrectly be interpreted as the end of the file.

        """
        
        if sizehint:
            Data = self._file.read(sizehint)
        else:
            Data = self._file.read()

        if len(Data) == sizehint:
            #print "The buffer is full"
            FullBuffer = 1
        else:
            FullBuffer = 0
        Data = Data.replace("\r\n","\n").replace("\r","\n")
        Lines = [line + "\n" for line in Data.split('\n')]
        ## If the last line is only a linefeed it is an extra line
        if Lines[-1] == "\n":
            del Lines[-1]
        ## if it isn't then the last line didn't have a linefeed, so we need to remove the one we put on.
        else:
            ## or it's the end of the buffer
            if FullBuffer:
                self._file.seek(-(len(Lines[-1])-1),1) # reset the file position
                del(Lines[-1])
            else:
                Lines[-1] = Lines[-1][:-1]
        return Lines

    def readnumlines(self,NumLines = 1):
        """

        readnumlines is an extension to the standard file object. It
        returns a list containing the number of lines that are
        requested. I have found this to be very useful, and allows me
        to avoid the many loops like:

        lines = []
        for i in range(N):
            lines.append(file.readline())

        Also, If I ever get around to writing this in C, it will provide a speed improvement.

        """
        Lines = []
        while len(Lines) < NumLines:
            Lines.append(self.readline())
        return Lines

    def read(self,size = None):
        """
     
        read acts like the regular read, except that it tranlates any of
        the standard text file line endings ("\r\n", "\n", "\r") into a
        "\n"
        
        If size is used, it will read a maximum of that many bytes,
        before translation. This means that if the line endings have
        more than one character, the size returned will be smaller. This
        could be fixed, but it didn't seem worth it. If you want that
        much control, use a binary file.
      
        """
        
        if size:
            Data = self._file.read(size)
        else:
            Data = self._file.read()
            
        return Data.replace("\r\n","\n").replace("\r","\n")
    
    def write(self,string):
        """

        write is just like the regular one, except that it uses the line
          separator specified when the file was opened for writing or
          appending.


        """
        self._file.write(string.replace("\n",self.LineSep))

    def writelines(self,list):
        for line in list:
            self.write(line)
        

    # The rest of the standard file methods mapped
    def close(self):
        self._file.close()
        self.closed = 1
    def flush(self):
        self._file.flush()
    def fileno(self):
        return self._file.fileno()
    def seek(self,offset,whence = 0):
        self._file.seek(offset,whence)
    def tell(self):
        return self._file.tell()
    

--------------54C7C351FE079B794391A813--