splitting tables

Sat Feb 7 16:56:23 EST 2004

On Sat, 7 Feb 2004 20:08:50 +0000 (UTC), robsom <no.mail at no.mail.it> wrote:

>
>Hi, I have a problem with a small python program I'm trying to write
>and I hope somebody may help me. I'm working on tables of this kind:
>
>CGA 1988 06 21 13 48 G500-050 D   509.62 J.. R1 1993 01 28 00 00 880006
>CGA 1988 06 21 14 04 G500-051 D   550.62 J.. R1 1993 01 28 00 00 880007
>
>I have to read each line of the table and put it into comma-separated
>lists like these for later manipulation:
>
>CGA,1988,06,21,13,48,G500-050,D,509.62,J..,R1,1993,01,28,00,00,880006
>CGA,1988,06,21,14,04,G500-051,D,550.62,J..,R1,1993,01,28,00,00,880007
>
>The 'split' function works pretty well, except when there is an error in
>the original data table. For example if an element is missin in a line,
>like this:
>
>CGA 1990 08 15 13 16 G500-105 D   524.45 J.. R1 1993 01 29 00 00 900069
>CGA 1990 08 16 01 22          D   508.06 J.. R1 1993 01 27 00 00 900065
>
>This error happens quite often in my dataset and the tables are too
>large to check for it manually. In this case what I get splitting the
>line string is of course this:
>
>CGA,1990,08,15,13,16,G500-105,D,524.45,J..,R1,1993,01,29,00,00,900069
>CGA,1990,08,16,01,22,D,508.06,J..,R1,1993,01,27,00,00,900065
>
>And when the program tries to work on the second list it stops (of course!).
>Is there any way to avoid this problem? This kind of error happens quite
>often in my dataset and the tables are usually too large to check for it
>manually. Thanks a lot for any suggestions.
>
 >>> s = """\
 ... CGA 1990 08 15 13 16 G500-105 D   524.45 J.. R1 1993 01 29 00 00 900069
 ... CGA 1990 08 16 01 22          D   508.06 J.. R1 1993 01 27 00 00 900065
 ... """
 >>> import re
 >>> rxo = re.compile(
 ...     '(...) (....) (..) (..) (..) (..) (........) (.)   '
 ...     '(......) (...) (..) (....) (..) (..) (..) (..) (......)'
 ... )
 >>> import csv
 >>> import sys
 >>> writer = csv.writer(sys.stdout)
 >>> for line in s.splitlines(): writer.writerow(*rxo.findall(line))
 ...
 CGA,1990,08,15,13,16,G500-105,D,524.45,J..,R1,1993,01,29,00,00,900069
 CGA,1990,08,16,01,22,        ,D,508.06,J..,R1,1993,01,27,00,00,900065

To write the csv lines to a file instead of sys.stdout, substitute (untested)
file('your_csv_output_file.csv') in place of sys.stdout in the above, and get your
lines from something like (note chopping off the trailing newline)

    for line in file('your_table_file'):
        line = line.rstrip('\n')

instead of

    for line in s.splitlines()

If you have possible short lines that create no match, you'll need to check for those
before unpacking (by using the prefixed *) into writer.writerow's arg list.

That's it for clp today ;-)

Regards,
Bengt Richter