I wonder if I would be able to collect data from such page using Python

Piet van Oostrum piet at vanoostrum.org
Thu Aug 22 00:54:36 EDT 2013


Comment Holder <commentholder at gmail.com> writes:

> Hi,
> I am totally new to Python. I noticed that there are many videos showing how to collect data from Python, but I am not sure if I would be able to accomplish my goal using Python so I can start learning.
>
> Here is the example of the target page:
> http://and.medianewsonline.com/hello.html
> In this example, there are 10 articles.
>
> What I exactly need is to do the following:
> 1- Collect the article title, date, source, and contents.
> 2- I need to be able to export the final results to excel or a database client. That is, I need to have all of those specified in step 1 in one row, while each of them saved in separate column. For example:
>
> Title1    Date1   Source1   Contents1
> Title2    Date2   Source2   Contents2
>
> I appreciate any advise regarding my case. 
>
> Thanks & Regards//

Here is an attempt for you. It uses BeatifulSoup 4. It is written in Python 3.3, so if you want to use Python 2.x you will have to make some small changes, like
from urllib import urlopen
and probably something with the print statements.

The formatting in columns is left as an exercise for you. I wonder how you would want that with multiparagraph contents.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: extract.py
URL: <http://mail.python.org/pipermail/python-list/attachments/20130822/5602f876/attachment.ksh>
-------------- next part --------------

-- 
Piet van Oostrum <piet at vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]


More information about the Python-list mailing list