pdf library.

Wed Jan 2 06:23:12 EST 2008

On Jan 1, 5:38 pm, Marc 'BlackJack' Rintsch <bj_... at gmx.net> wrote:
> On Tue, 01 Jan 2008 04:21:29 -0800,Shriphaniwrote:
> > On Jan 1, 4:28 pm, Piet van Oostrum <p... at cs.uu.nl> wrote:
> >> >>>>>Shriphani<shripha... at gmail.com> (S) wrote:
> >> >S> I tried pyPdf for this and decided to get the pagelinks. The trouble
> >> >S> is that I don't know how to determine whether a particular page is the
> >> >S> first page of a chapter. Can someone tell me how to do this ?
>
> >> AFAIK PDF doesn't have the concept of "Chapter". If the document has an
> >> outline, you could try to use the first level of that hierarchy as the
> >> chapter starting points. But you don't have a guarantee that they really
> >> are chapters.
>
> > How would a pdf to html conversion work ? I've seen Google's search
> > engine do it loads of times. Just that running a 500odd page ebook
> > through one of those scripts might not be such a good idea.
>
> Heuristics?  Neither PDF nor HTML know "chapters".  So it might be
> guesswork or just in your head.
>
> Ciao,
>         Marc 'BlackJack' Rintsch

I could parse the html and check for the words "unit" or "chapter" at
the beginning of a page. I am using pdftohtml on Debian and it seems
to be generating the html versions of pdfs quite fast. I am yet to run
a 500 page pdf through it though.
Regards,
Shriphani