Picking apart a text line
Russell Owen
rowen at uw.edu
Mon Mar 2 15:25:34 EST 2015
On 2/26/15 7:53 PM, memilanuk wrote:
> So... okay. I've got a bunch of PDFs of tournament reports that I want
> to sift thru for information. Ended up using 'pdftotext -layout
> file.pdf file.txt' to extract the text from the PDF. Still have a few
> little glitches to iron out there, but I'm getting decent enough results
> for the moment to move on.
>
...
> So back to the lines of text I have stored as strings in a list. I
> think I want to convert that to a list of lists, i.e. split each line
> up, store that info in another list and ditch the whitespace. Or would
> I be better off using dicts? Originally I was thinking of how to
> process each line and split it them up based on what information was
> where - some sort of nested for/if mess. Now I'm starting to think that
> the lines of text are pretty uniform in structure i.e. the same field is
> always in the same location, and that list slicing might be the way to
> go, if a bit tedious to set up initially...?
>
> Any thoughts or suggestions from people who've gone down this particular
> path would be greatly appreciated. I think I have a general
> idea/direction, but I'm open to other ideas if the path I'm on is just
> blatantly wrong.
It sounds to me as if the best way to handle all this is keep the
information it in a database, preferably one available from the network
and centrally managed, so whoever enters the information in the first
place enters it there. But I admit that setting such a thing up requires
some overhead.
Simpler alternatives include using SQLite, a simple file-based database
system, or numpy structured arrays (arrays with named fields). Python
includes a standard library module for sqlite and numpy is easy to install.
-- Russell
More information about the Python-list
mailing list