[Tutor] parsing

Thu Jul 13 12:00:01 CEST 2006

Hi!

Give a look at http://www.crummy.com/software/BeautifulSoup/

BeautifulSoup is a python module designed for parsing html

Carlo

what is ITER? www.iter.org

>>
>>		First, excuse me my English... English is not my native
>>language, but
>>		I hope
>>		that I will be able to describe my problem.
>>
>>		I am new in python for web, but I want to do such thing:
>>
>>		Suppose I have a html-page, like this:
>>		"""
>>		<title>TITLE</title>
>>		<body>
>>		body_1
>>		<h1>1_1</h1>
>>		<h2>2_1</h2>
>>		<div id=one>div_one_1</div>
>>		<p>p_1</p>
>>		<p>p_2</p>
>>		<div id=one>div_one_2</div>
>>		<span class=sp_1>
>>		sp_text
>>		<div id=one>div_one_2</div>
>>		<div id=one>div_one_3</div>
>>		</span>
>>		<h3>3_1</h3>
>>		<h2>2_2</h2>
>>		<p>p_3</p>
>>		body_2
>>		<h1>END</h1>
>>		<table>
>>		<tr><td>td_1</td>
>>		<td class=sp_2>td_2</td>
>>		<td>td_3</td>
>>		<td>td_4</td></tr>
>>		...
>>		</body>
>>
>>		"""
>>
>>		I want to get all info from this html in a dictionary
that
>>looks like
>>		this:
>>
>>		rezult = [{'title':['TITLE'],
>>		{'body':['body_1', 'body_2']},
>>		{'h1':['1_1', 'END']},
>>		{'h2':['2_1', '2_2']},
>>		{'h3':['3_1']},
>>		{'p':['p_1', 'p_2']},
>>		{'id_one':['div_one_1', 'div_one_2', 'div_one_3']},
>>		{'span_sp_1':['sp_text']},
>>		{'td':['td_1', 'td_3', 'td_4']},
>>		{'td_sp_2':['td_2']},
>>		....
>>		]
>>
>>		Huh, hope you understand what I need.
>>		Can you advise me what approaches exist to solve tasks
of such
>>type...
>>		and
>>		may be show some practical examples....
>>		Thanks in advance for help of all kind...
>>
>>
>>
>>	Try ElementTree or Amara.
>>	http://effbot.org/zone/element-index.htm
>>	http://uche.ogbuji.net/tech/4suite/amara/
>>
>>	If you only cared about contents, BeautifulSoup is the answer.
>>
>>	Ismael
>>	_______________________________________________
>>	Tutor maillist  -  Tutor at python.org
>>	http://mail.python.org/mailman/listinfo/tutor
>>