Parseador HTML en Python

Mar Jul 4 10:38:12 CEST 2006

Cesar Ortiz(e)k dio:
> Una opción que yo he usado es tidy + HTMLParser. Pero es mas recomendable
> como ya te han indicado un parser menos estricto.
> Beautiful Soup no lo he probado, pero si libxml2. Libxml2 (está escrito en
> C) en su propia distribución tiene bindings para python:
> http://xmlsoft.org/python.html.
> Otro binding para libxml2 es: http://codespeak.net/lxml/.

lxml incorpora también un parser para HTML (es el de libxml2):

>>> from lxml import etree
>>> from urllib2 import urlopen
>>> sock = urlopen('http://www.google.com/')
>>> doc = etree.HTML(sock.read())
>>> etree.tostring(doc)
'<html><head><meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1"/><title>Google</title>
(...)


Más documentación:
http://codespeak.net/svn/lxml/trunk/doc/api.txt


Saludos,


-- 
Mikel Larreategi
mlarreategi en codesyntax.com

CodeSyntax
Azitaingo Industrialdea 3 K
E-20600 Eibar
Tel: (+34) 943 82 17 80