[Tutor] extracting informations (images and text) from a PDF andcreating a database from it

Shashwat Anand anand.shashwat at gmail.com
Tue Dec 29 10:51:57 CET 2009


I used PDFMiner and I was pretty satisfied with the text portions. I
retrieved all the text and was able to manipulate it according to my wish.
However I failed on Image part. So Technically my question reduces to 'If
there  a PDF document and some verbose text below them and the pattern is
followed i.e. per page of PDF there will be one image and some texts
following it, how can I retrieve both the images and the text without loss'
?

On Tue, Dec 29, 2009 at 2:59 PM, Alan Gauld <alan.gauld at btinternet.com>wrote:

> "Shashwat Anand" <anand.shashwat at gmail.com> wrote
>
>
>  I need to make a database from some PDFs. I need to extract logos as well
>> as
>> the information (i.e. name,address) beneath the logo and fill it up in
>> database. The logo can be text as well as picture as shown in two of the
>> screenshots of one of the sample PDF file:
>> http://imagebin.org/77378
>> http://imagebin.org/77379
>>
>
> You could try PDFMiner to extract direct from the PDF using Python.
>
>
>  Will converting to html  a good option? Later on I need to apply some
>> image
>> processing too. What should be the ideal way towards it ?
>>
>
> Converting to html (assuming you have a tool to do that!) may be better
> since there are a wider choice of tools and more experience to help you.
> Or there are various commercial tools for converting PDF into Word etc.
>
> I've never personally had to extract data from a PDF, I've always had
> access
> to the source documents so I can't comment on how effective each approach
> is...
>
> --
> Alan Gauld
> Author of the Learn to Program web site
> http://www.alan-g.me.uk/
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20091229/14a36eac/attachment.htm>


More information about the Tutor mailing list