Reading plain text file database tables

Fri Sep 10 01:33:33 EDT 1999

I have come up with a rarther ugly, slow, but
working RE solution for the problem. The two
functions can read comma delimited tables and
space delimited tables. Quotation, backslash escape
are considered. Python source below.

-----------------------------------------------------------
def cut_comma(lines):
    """This function is used to read comma delimited files(.csv).

    Input: the rows of the data file in a list 'lines'.
    Output: a list of list x, x[i] is the i'th row of the input, 
            x[i][j] is the j'th data item on the i'th row.

    The scanning respects quoting, e.g., the row
          "Bond, James", 13, 45
    will be recognized as three column ["Bond, James", "13", "45"]
successfully. 
    Escape characters like 
          'That\'s absolutely possible.', 1999, 8, 30
    are treated correctly.

    Shortcoming: very slow when reading large datasets. On a SUN Sparc
2, reading
    10000 lines of input took 20 seconds. Suggestion is welcome.

    Author: Li Dongfeng, ldf at statms.stat.pku.edu.cn
    Last modified: 1999.9.10
    """

    r = re.compile(r"""
	\s*     # data item can start with any number of spaces
	(?P<item>        # start of data item we need to extract
          '' | ""        # empty string
         |
	  '.*?[^\\]'   # matches any thing enclosed in '...'(can have comma)
		         #  but the last "'" have no '\' before it. Find first ending
'
	 |
	  ".*?[^\\]"   # matches any thing enclosed in "..."(can have comma) 
		       # but the last '"' have no '\' before it
	 |
	  [^"',][^,]*?  # matches anything that do not contain a comma, 
		       #and not starting with "'" or '"' or comma
	 |
	  "[^,]*?(?!.*?[^\\]")   # anything starting with '"' and have no
comma,
                               # no matching '"' behind
	 |
	  '[^,]*(?!.*?[^\\]')   # anything starting with "'" and have no comma,
                               # no matching "'" behind
         |           # can be empty between two commas
	)            
	(?:\s*
	 ,  #  end with a comman(can have spaces before it)
	)    
       """, re.VERBOSE)
    r2=re.compile(r"""^(['"]).*\1$""", re.MULTILINE)  # anything quoted
    def quote(match):
        return eval(match.group())
    if type(lines) is type(()) or type(lines) is type([]):
        # sub comma with newline, but respect string quoting
        x = map(lambda s, r=r: r.sub("\\g<item>\n", s), lines)
	# unquote all quoted
	x = map(lambda s, r=r2, f=quote: r.sub(f, s), x)  
	# split the rows
        x = map(lambda s: string.split(s, "\n"), x)
    else:
	x = r.sub("\\g<item>\n", lines)
	x = r2.sub(quote, x)
        x = string.split(x, "\n")
    return x

def cut_space(lines):
    """This function is used to read space delimited files.

    Input: the rows of the data file in a list 'lines'.
    Output: a list of list x, x[i] is the i'th row of the input, 
            x[i][j] is the j'th data item on the i'th row.

    Any number of spaces(space, tab) can be used between two data items.
    The scanning respects quoting, e.g., the row
          "James Bond"  13  45
    will be recognized as three column ["James Bond", "13", "45"]
successfully. 
    Escape characters like 
          'That\'s absolutely possible.'  1999  8  30
    are treated correctly.

    Shortcoming: very slow when reading large datasets. On a SUN Sparc
2, reading
    10000 lines of input took 20 seconds. Suggestion is welcome.

    Author: Li Dongfeng, ldf at statms.stat.pku.edu.cn
    Last modified: 1999.9.10
    """

    r = re.compile(r"""
	(?P<item>        # start of data item we need to extract
          '' | ""        # empty string
         |
	  '.*?[^\\]'   # matches any thing enclosed in '...'(can have comma)
		         #  but the last "'" have no '\' before it. Find first ending
'
	 |
	  ".*?[^\\]"   # matches any thing enclosed in "..."(can have comma) 
		       # but the last '"' have no '\' before it
	 |
	  [^"'\s][^\s]*?  # matches anything that do not contain a space, 
		       #and not starting with "'" or '"' or space
	 |
	  "[^\s]*?(?!.*?[^\\]")   # anything starting with '"' and have no
space,
                               # no matching '"' behind
	 |
	  '[^\s]*(?!.*?[^\\]')   # anything starting with "'" and have no
space,
                               # no matching "'" behind
	) 
	(?:\s+)  #  end with a space
       """, re.VERBOSE)
    r2=re.compile(r"""^(['"]).*\1$""", re.MULTILINE)  # anything quoted
    def quote(match):
        return eval(match.group())
    if type(lines) is type(()) or type(lines) is type([]):
        # sub comma with newline, but respect string quoting
        x = map(lambda s, r=r: r.sub("\\g<item>\n", s), lines)
	print x[0]
	# unquote all quoted
	x = map(lambda s, r=r2, f=quote: r.sub(f, s), x)  
	print x[0]
	# split the rows
        x = map(lambda s: string.split(s, "\n"), x)
	print x[0]
    else:
	x = r.sub("\\g<item>\n", lines)
	x = r2.sub(quote, x)
        x = string.split(x, "\n")
    return x

-------------------------------------------------------------------------

Li Dongfeng wrote:
> 
> Do we have a module to read plain text file
> database tables?
> 
> All the data management software, e.g. excel,
> dBase, SAS, etc., support input/output a table
> from/to a plain text file, fields can be separated
> by their column position, by spaces, by tabs,
> by commas, etc.
> 
> How can we read this kind of file into a matrix like
> structure(list of lists)? I have written one reading
> files with fixed-width fields. For delimited files,
> simply using string.split work most of the time, but
> fails reading lines like
> 
>   "Peter Thomson"  25  36
> 
> or even
> 
>   "Peter\" Thomson" 25 36
> 
> I think this is a common task, so maybe someone has
> already given a very good solution.
> 
> Li Dongfeng