Picking apart a text line

Fri Feb 27 02:39:58 EST 2015

On 02/26/2015 10:53 PM, memilanuk wrote:
> So... okay.  I've got a bunch of PDFs of tournament reports that I want
> to sift thru for information.  Ended up using 'pdftotext -layout
> file.pdf file.txt' to extract the text from the PDF.  Still have a few
> little glitches to iron out there, but I'm getting decent enough results
> for the moment to move on.
>
> I've got my script to where it opens the file, ignores the header lines
> at the top, then goes through the rest of the file line by line,
> skipping lines if they don't match (don't need the separator lines) and
> adding them to a list if they do (and stripping whitespace off the right
> side along the way).  So far, so good.
>
> #  rstatPDF2csv.py
>
> import sys
> import re
>
>
> def convert(file):
>      lines = []
>      data = open(file)
>
>      # Skip first n lines of headers
>      for i in range(9):
>          data.__next__()
>
>      # Read remaining lines one at a time
>      for line in data:
>
>          # If the line begins with a capital letter...
>          if re.match(r'^[A-Z]', line):
>
>              # Strip any trailing whitespace and then add to the list
>              lines.append(line.rstrip())
>
>      return lines
>
> if __name__ == '__main__':
>      print(convert(sys.argv[1]))
>
>
>
> What I'm ending up with is a list full of strings that look something
> like this:
>
> ['JOHN DOE                    C   T   HM   445-20*MW*   199-11*MW* 194-5
> 1HM     393-16*MW*   198-9 1HM    198-11*MW*    396-20*MW*
> 789-36*MW*     1234-56 *MW*',
>
> Basically... a certain number of characters allotted for competitor
> name, then four or five 1-2 char columns for things like classification,
> age group, special categories, etc., then a score ('445-20'), then up to
> 4 char for award (if any), then another score, another award, etc. etc.
> etc.
>
> Right now (in the PDF) the scores are batched by one criterion, then
> sorted within those groups.  Makes life easier for the person giving out
> awards at the end of the tournament, not so much for someone trying to
> see how their individual score ranks against the whole field, not just
> their group or sub-group.  I want to be able to pull all the scores out
> and then re-sort based on score - mainly the final aggregate score, but
> potentially also on stage or daily scores.  Eventually I'd like to be
> able to calculate standardized z-scores so as to be able to compare
> scores from one event/location against another.
>
> So back to the lines of text I have stored as strings in a list.  I
> think I want to convert that to a list of lists, i.e. split each line
> up, store that info in another list and ditch the whitespace.  Or would
> I be better off using dicts?  Originally I was thinking of how to
> process each line and split it them up based on what information was
> where - some sort of nested for/if mess.  Now I'm starting to think that
> the lines of text are pretty uniform in structure i.e. the same field is
> always in the same location, and that list slicing might be the way to
> go, if a bit tedious to set up initially...?
>
> Any thoughts or suggestions from people who've gone down this particular
> path would be greatly appreciated.  I think I have a general
> idea/direction, but I'm open to other ideas if the path I'm on is just
> blatantly wrong.
>

Maintaining a list of lists is a big pain.  If the data is truly very 
uniform, you might want to do it, but I'd find it much more reasonable 
to have names for the fields of each line.  You can either do that with 
a named-tuple, or with instances of a custom class of your own.

See 
https://docs.python.org/3.4/library/collections.html#namedtuple-factory-function-for-tuples-with-named-fields

You read a line, do some sanity checking on it, and construct an object. 
  Go to the next line, do the same, another object.  Those objects are 
stored in a list.

Everything else accesses the fields of the object something like:

for row in  mylist:
     print( row.name, row.classification, row.age)
     if row.name == "Doe":
          ...

-- 
DaveA