[Tutor] write program to extract data

Fri Aug 14 22:18:47 CEST 2009

Michael Miesner wrote:
> Hi-
> I work in a research lab and part of the lab I'm not usually associated with
> uses a program that outputs data in a .txt file for each participant that is
> run.
> The participant # is the title of the text document (ie E00343456.txt) style
> and I'd like to be able to take this and other data in the file and draw it
> into a spreadsheet.
> The first 3/4 of the output is the scenario. I've bolded the areas that I
> really want to be able to draw out. In case some people cant see the bold,
> they are the sections called "driver mistakes" and individual mistakes.
>
>
> Preferably, what I'd really like to do is make the script so that I execute
> it, and in doing so, tell it what folder to look in, and it takes all the
> .txt's out of that folder, and adds them to the spreadsheet. I'd like the
> title of the .txt to be the first column, and the data held in the
> spreadsheet to be the be the next columns.
>
> Below is output of 1 data file.
> ----------------------------------------------------------------
>  Date: August 13, 2009
>  Time: 12:16:18:151 PM
>  ID:
>  Scenario file: C:\Documents and Settings\APL02\Desktop\Driving Sim
> Files\Sim 10-21-08.txt
>  Configuration file:
>
> <Snip lots of lines>
>
>    7  0  0  0   204   130.55     0 999.00
>    8  0  0  0
>    9  0  0  0
>   10  0  0  0
>
>  *Driver mistakes:
>
>  Total number of off road accidents =      1
>  Total number of collisions =              3
>  Total number of pedestrians hit =         3
>  Total number of speed exceedances =       11
>  Total number of speeding tickets =        0
>  Total number of traffic light tickets =   1
>  Total number of stop signs missed =       0
>  Total number of centerline crossings =    5
>  Total number of road edge excursions =    4
>  Total number of stops at traffic lights = 2
>  Total number of correct DA responses =    0
>  Total number of incorrect DA responses =  0
>  Total number of DAs with no response =    0
>  Total run length (Drive T, X, Total T) =  761.92       34000         761.92
>
>  Total number of illegal turns =           0
>  Total number of low speed warnings =      0
>  Total number of high speed warnings =     0
>  Over speed limit (% Time, % Distance) =   27.15        47.77
>  Out of lane (% Time, % Distance) =        4.22         3.74
>
>  Individual mistakes (Time, Distance, Elapsed distance or object number,
> Elapsed time, Maximum value):
>
>  Centerline crossing       68.16     3311.11      110.69        2.60
> -3.46
>  Centerline crossing       94.26     4598.99      252.83        4.45
> -3.19
>  Centerline crossing      127.85     6758.39      109.70        1.92
> -2.78
>  Centerline crossing      162.14     9082.09      273.04        3.27
> -9.72
>  Speed exceedance         162.80     9135.05     4596.82       48.87
> 120.00
>  Road edge excursion      166.34     9438.89       77.08        0.83
> 13.45
>  Hit pedestrian           204.46    13731.87           1
>  Speed exceedance         221.51    13829.40      912.17       22.75
> 67.13
>  Hit pedestrian           237.05    14741.43           5
>  Speed exceedance         259.93    15171.25      746.18       13.35
> 60.35
>  Centerline crossing      317.69    16126.40       14.76        8.15
> -4.37
>  Vehicle collision (F)    318.63    16141.15          66
>  Vehicle collision (F)    341.91    16166.52          91
>  Red light ticket         452.04    16548.74           2
>  Speed exceedance         458.24    16802.82     1484.23       18.15
> 95.58
>  Speed exceedance         516.14    20362.78      685.87       10.13
> 68.67
>  Speed exceedance         551.24    21996.05     2111.93       24.31
> 91.76
>  Vehicle collision (F)    580.58    24459.77         163
>  Speed exceedance         606.49    25376.76     3929.19       42.84
> 120.00
>  Road edge excursion      622.00    26916.68       86.74        0.77
> 13.30
>  Road edge excursion      623.62    27100.59      251.81        2.17
> 14.42
>  Road edge excursion      641.31    29212.44       93.52        8.02
> 19.52
>  Off road accident        642.11    29305.93
>  Speed exceedance         664.99    30081.20      777.80       10.76
> 82.64
>  Hit pedestrian           695.29    31013.94          30
>  Speed exceedance         719.30    31513.55      915.64       14.16
> 71.88
>  Speed exceedance         742.46    32837.36       81.72        1.57
> 52.90
>  Speed exceedance         749.76    33199.52      801.52       12.16
> 76.04
>
>
> *
>   
Did you really have to quote all 1100 lines, when what you really wanted 
to say was there's a bunch of junk at the beginning of the file, then 
the following stuff ?

Anyway, the most important question is how rigid is this file format?   
Will there always be exactly one line starting  "* Driver mistakes"?  
Will the lines between that and the final "*" always be the same ones, 
and in the same order?  Or is "Road edge excursion" somehow an optional 
line.  or whatever?  Are the numbers identified by the column they start 
in, or by spaces between them?  Is a zero stored sometimes as an empty 
field?

It already appears that some lines are appearing multiple times, and not 
in order.  Assuming that's deliberate, is there a maximum number of 
times for any given line?  If not, how do you plan to encode it in a 
spreadsheet, which generally has fixed columns, with a specific meaning 
for each one?

Once you've got a spec for the file requirements, you're left with three 
pieces to write.  And if the file requirements are tricky, you may want 
to write a validator for the file.  GIGO, you know.

1) loop through all the txt files in a given directory.  Check out  
glob, or os.path, or even  inputfile.
2) parse the lines of the file.  A simple loop, checking the line 
against "* Driver mistakes".  Then another loop, building a list of the 
fields from those lines.  This should do some integrity checking, in 
case some line is short one of its numeric fields, for example.
3) write the results to a csv file, one call per input file.  Check out 
the csv module.

I'd suggest building a dummy set of files, with just a few lines in 
each, but the same general format.  Then when you have the program 
working for those, generalize it.

DaveA