[Tutor] parsing
Carlo Capuano
Carlo.Capuano at iter.org
Thu Jul 13 12:00:01 CEST 2006
Hi!
Give a look at http://www.crummy.com/software/BeautifulSoup/
BeautifulSoup is a python module designed for parsing html
Carlo
what is ITER? www.iter.org
>>
>> First, excuse me my English... English is not my native
>>language, but
>> I hope
>> that I will be able to describe my problem.
>>
>> I am new in python for web, but I want to do such thing:
>>
>> Suppose I have a html-page, like this:
>> """
>> <title>TITLE</title>
>> <body>
>> body_1
>> <h1>1_1</h1>
>> <h2>2_1</h2>
>> <div id=one>div_one_1</div>
>> <p>p_1</p>
>> <p>p_2</p>
>> <div id=one>div_one_2</div>
>> <span class=sp_1>
>> sp_text
>> <div id=one>div_one_2</div>
>> <div id=one>div_one_3</div>
>> </span>
>> <h3>3_1</h3>
>> <h2>2_2</h2>
>> <p>p_3</p>
>> body_2
>> <h1>END</h1>
>> <table>
>> <tr><td>td_1</td>
>> <td class=sp_2>td_2</td>
>> <td>td_3</td>
>> <td>td_4</td></tr>
>> ...
>> </body>
>>
>> """
>>
>> I want to get all info from this html in a dictionary
that
>>looks like
>> this:
>>
>> rezult = [{'title':['TITLE'],
>> {'body':['body_1', 'body_2']},
>> {'h1':['1_1', 'END']},
>> {'h2':['2_1', '2_2']},
>> {'h3':['3_1']},
>> {'p':['p_1', 'p_2']},
>> {'id_one':['div_one_1', 'div_one_2', 'div_one_3']},
>> {'span_sp_1':['sp_text']},
>> {'td':['td_1', 'td_3', 'td_4']},
>> {'td_sp_2':['td_2']},
>> ....
>> ]
>>
>> Huh, hope you understand what I need.
>> Can you advise me what approaches exist to solve tasks
of such
>>type...
>> and
>> may be show some practical examples....
>> Thanks in advance for help of all kind...
>>
>>
>>
>> Try ElementTree or Amara.
>> http://effbot.org/zone/element-index.htm
>> http://uche.ogbuji.net/tech/4suite/amara/
>>
>> If you only cared about contents, BeautifulSoup is the answer.
>>
>> Ismael
>> _______________________________________________
>> Tutor maillist - Tutor at python.org
>> http://mail.python.org/mailman/listinfo/tutor
>>
More information about the Tutor
mailing list