[Half-off] How to get textboxes (text blocks) from ps/pdf files?

durumdara durumdara at gmail.com
Wed Jan 3 07:30:22 EST 2007


Hi!

I need to get textboxes/textblocks from pdf files. I can convert them 
into ps.
Is anyone knows about method, trick, routine to I can get the textboxes 
from ps or pdf?
(Pythonic, COM, or command line solutions needed.)

I need to redraw them into my application, and user can reorder them, 
and next I concat. every text to process it.

I need these infos:
x, y, w, h, text

Example:
page1
textbox1{x:100,y:100;w:600;h:27;text:"TextBox1 /xfc /xfa"}
textbox2{x:100,y:180;w:600;h:27;text:"TextBox2"}
page2
textbox1{x:100,y:100;w:600;h:27;text:"TextBox1"}
textbox2{x:100,y:180;w:600;h:27;text:"TextBox2"}
...

Any solution?

Thanks for it!
    dd

ps1:
    I tried every pdf2text and pdf2html application. All failed in the 
test.
    Only one provide good informations, the pdftohtml, because it is 
makes divs with abs. position and size and the texts.
    But this program is not handle the iso-8859-2 chars, so I lost them.

ps2:
    The program must run under Windows XP. So the solution is os specific.
   




More information about the Python-list mailing list