Converting a text data file from positional to tab delimited.

Alex Martelli aleaxit at yahoo.com
Tue Mar 13 10:40:48 EST 2001


"Lee Joramo" <lee at joramo.com> wrote in message
news:B6D37840.2124%lee at joramo.com...
> I am looking for suggestions to speed up the process of converting a large
> text data file from 'positional' layout to tab delimited. The data file is
> over 200MB in size containing over 40,000 lines which have over 600
fields.
>
> I suspect that the 'for' loop that splits each line into tab delimited,
> could be optimized. Perhaps it could be replaced with a regex or other
> technique.
    [snip]
> layout = [
>     ['STUDY', 0, 7]
>     ['MDLNO', 8, 12]
    [snip]
>         for field in layout:
>             #
>             #can this loop be improved??
>             #
>             fieldValue = line[field[1]:field[2]]

First of all, there seems to be a serious bug here: it looks like the
layout has a pair of numbers, the upper one of which is meant to be
*included*, but the a:b slice notation *excludes* b -- so, you most
likely want line[field[1]:field[2]+1].

>             delimitedLine = delimitedLine + delimit + fieldValue
>             delimit = "\t"
>         outFile.write(delimitedLine+"\n")

A small optimization is to use a _list_ of pieces (strings) and
output them with a single .writelines call (which does not just
write _lines_, but arbitrary strings).

As you know beforehand the number of fields in your layout, you
can prepare the list-of-pieces in advance:

list_of_pieces = ['\t'] * (2*len(layout)-1)
indexed_fields = zip(layout, range(0,len(layout),2))

and fill alternate pieces with other-than-tabs in the nested
loop:

    for field, index in indexed_fields:
        list_of_pieces[index] = line[field[1]:field[2]+1]
    outFile.writelines(list_of_pieces)

We can do a little more work in advance, outside of the loop:

indexed_fields = [ (i*2, layout[i][1], layout[i][2]+1)
    for i in range(len(layout)) ]

and the loop becomes the very-slightly-faster:

    for index, lower, upper in indexed_fields:
        list_of_pieces[index] = line[lower:upper]

I don't know if a substantially faster approach exists to
avoid the hundreds of calls to line[lower:upper] for each
line.  You could break the line into a list of characters
once just before the inner-loop then slice the list, but
I don't know if that would gain you anything (you'd have
to join the sublists at some point, or have a much bigger
list-of-pieces, and either approach seems of doubtful
gain).

Another possibility is avoiding some data-copy by having
the inner-loop's statement as:
        list_of_pieces[index] = buffer(line, offset, size)
you'll need to prepare the indexed_fields a bit differently:
indexed_fields = [ (i*2, layout[i][1], layout[i][2]-layout[i][1]+1)
    for i in range(len(layout)) ]
and the loop itself becomes:
    for index, offset, size in indexed_fields:

Not sure if .writelines supports a list of alternate
strings and buffers, or what the speed becomes then --
just suggesting alternatives for you to try...


Alex






More information about the Python-list mailing list