Picking apart a text line

Russell Owen rowen at uw.edu
Mon Mar 2 15:25:34 EST 2015


On 2/26/15 7:53 PM, memilanuk wrote:
> So... okay.  I've got a bunch of PDFs of tournament reports that I want
> to sift thru for information.  Ended up using 'pdftotext -layout
> file.pdf file.txt' to extract the text from the PDF.  Still have a few
> little glitches to iron out there, but I'm getting decent enough results
> for the moment to move on.
>
...
> So back to the lines of text I have stored as strings in a list.  I
> think I want to convert that to a list of lists, i.e. split each line
> up, store that info in another list and ditch the whitespace.  Or would
> I be better off using dicts?  Originally I was thinking of how to
> process each line and split it them up based on what information was
> where - some sort of nested for/if mess.  Now I'm starting to think that
> the lines of text are pretty uniform in structure i.e. the same field is
> always in the same location, and that list slicing might be the way to
> go, if a bit tedious to set up initially...?
>
> Any thoughts or suggestions from people who've gone down this particular
> path would be greatly appreciated.  I think I have a general
> idea/direction, but I'm open to other ideas if the path I'm on is just
> blatantly wrong.

It sounds to me as if the best way to handle all this is keep the 
information it in a database, preferably one available from the network 
and centrally managed, so whoever enters the information in the first 
place enters it there. But I admit that setting such a thing up requires 
some overhead.

Simpler alternatives include using SQLite, a simple file-based database 
system, or numpy structured arrays (arrays with named fields). Python 
includes a standard library module for sqlite and numpy is easy to install.

-- Russell




More information about the Python-list mailing list