Python, Perl & PDF files

Cameron Laird claird at lairds.us
Wed Apr 27 19:09:36 EDT 2005


In article <d4mme5$2ej$1 at solaris.cc.vt.edu>,
rbt  <rbt at athop1.ath.vt.edu> wrote:
>Cameron Laird wrote:
>> In article <d4m9hl$8br$1 at solaris.cc.vt.edu>,
>> rbt  <rbt at athop1.ath.vt.edu> wrote:
>> 			.
>> 			.
>> 			.
>> 
>>>Read and search them for strings. If I could do that on windows, linux 
>>>and mac with the *same* bit of Python code, I'd be very happy ;)
>> 
>> 
>> Textual content, right?  Without regard to font funniness, or
>> whether the string is in or out of a table, and so on?
>
>That's right. More specifically, I've written a script that uses a RE to search 
>through documents for social security numbers. You can see it here:
>
>http://filebox.vt.edu/users/rtilley/public/find_ssns/find_ssns.html
>
>This works on Word, Excel, html, rtf or any ANSI based text. I need the
>ability to 
>read and make sense of PDF files as well so I can apply the RE to their
>content. It's 
>been frustrating to say the least. Nothing at all against Python...
>mostly just sick 
>of hearing about the 'Portable' document format that isn't string or RE
>searchable... 
>at least not easily anyway.
			.
			.
			.
PDF is NOT easy to search.  'Fact, many times it's not even feasible,
in any automated sense.  

When I can make time, I want to look into your Word and Excel searching;
there are several tricks for doing these in full generality.

Unless I've missed late-breaking news, Perl does NOT help, despite the
flashy appearance of the CPAN search page you referenced.  None of that
stuff gets at content in a sense that'll serve you well.

Neither does anything open-sourced in Python.  The best I know is what
I'm slowly documenting at <URL:
http://phaseit.net/claird/comp.text.pdf/PDF_converters.html#pdf2txt >,
as David mentioned earlier.



More information about the Python-list mailing list