Picking apart a text line

Thu Feb 26 22:53:34 EST 2015

So... okay.  I've got a bunch of PDFs of tournament reports that I want 
to sift thru for information.  Ended up using 'pdftotext -layout 
file.pdf file.txt' to extract the text from the PDF.  Still have a few 
little glitches to iron out there, but I'm getting decent enough results 
for the moment to move on.

I've got my script to where it opens the file, ignores the header lines 
at the top, then goes through the rest of the file line by line, 
skipping lines if they don't match (don't need the separator lines) and 
adding them to a list if they do (and stripping whitespace off the right 
side along the way).  So far, so good.

#  rstatPDF2csv.py

import sys
import re

def convert(file):
     lines = []
     data = open(file)

     # Skip first n lines of headers
     for i in range(9):
         data.__next__()

     # Read remaining lines one at a time
     for line in data:

         # If the line begins with a capital letter...
         if re.match(r'^[A-Z]', line):

             # Strip any trailing whitespace and then add to the list
             lines.append(line.rstrip())

     return lines

if __name__ == '__main__':
     print(convert(sys.argv[1]))

What I'm ending up with is a list full of strings that look something 
like this:

['JOHN DOE                    C   T   HM   445-20*MW*   199-11*MW* 
194-5 1HM     393-16*MW*   198-9 1HM    198-11*MW*    396-20*MW* 
789-36*MW*     1234-56 *MW*',

Basically... a certain number of characters allotted for competitor 
name, then four or five 1-2 char columns for things like classification, 
age group, special categories, etc., then a score ('445-20'), then up to 
4 char for award (if any), then another score, another award, etc. etc. etc.

Right now (in the PDF) the scores are batched by one criterion, then 
sorted within those groups.  Makes life easier for the person giving out 
awards at the end of the tournament, not so much for someone trying to 
see how their individual score ranks against the whole field, not just 
their group or sub-group.  I want to be able to pull all the scores out 
and then re-sort based on score - mainly the final aggregate score, but 
potentially also on stage or daily scores.  Eventually I'd like to be 
able to calculate standardized z-scores so as to be able to compare 
scores from one event/location against another.

So back to the lines of text I have stored as strings in a list.  I 
think I want to convert that to a list of lists, i.e. split each line 
up, store that info in another list and ditch the whitespace.  Or would 
I be better off using dicts?  Originally I was thinking of how to 
process each line and split it them up based on what information was 
where - some sort of nested for/if mess.  Now I'm starting to think that 
the lines of text are pretty uniform in structure i.e. the same field is 
always in the same location, and that list slicing might be the way to 
go, if a bit tedious to set up initially...?

Any thoughts or suggestions from people who've gone down this particular 
path would be greatly appreciated.  I think I have a general 
idea/direction, but I'm open to other ideas if the path I'm on is just 
blatantly wrong.

Thanks,

Monte

-- 
Shiny!  Let's be bad guys.

Reach me @ memilanuk (at) gmail dot com