HTML parsing confusion

Wed Jan 23 03:54:14 EST 2008

On 2008-01-23 01:29, Gabriel Genellina wrote:
> En Tue, 22 Jan 2008 19:20:32 -0200, Alnilam <alnilam at gmail.com> escribió:
> 
>> On Jan 22, 11:39 am, "Diez B. Roggisch" <de... at nospam.web.de> wrote:
>>> Alnilam wrote:
>>>> On Jan 22, 8:44 am, Alnilam <alni... at gmail.com> wrote:
>>>>>> Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
>>>>>> -1)) doesn't have an xml.dom.ext ... you must have the  
>>> mega-monstrous
>>>>>> 200-modules PyXML package installed. And you don't want the 75Kb
>>>>>> BeautifulSoup?
>>>> Ugh. Found it. Sorry about that, but I still don't understand why
>>>> there isn't a simple way to do this without using PyXML, BeautifulSoup
>>>> or libxml2dom. What's the point in having sgmllib, htmllib,
>>>> HTMLParser, and formatter all built in if I have to use use someone
>>>> else's modules to write a couple of lines of code that achieve the
>>>> simple thing I want. I get the feeling that this would be easier if I
>>>> just broke down and wrote a couple of regular expressions, but it
>>>> hardly seems a 'pythonic' way of going about things.
>>> This is simply a gross misunderstanding of what BeautifulSoup or lxml
>>> accomplish. Dealing with mal-formatted HTML whilst trying to make _some_
>>> sense is by no means trivial. And just because you can come up with a  
>>> few
>>> lines of code using rexes that work for your current use-case doesn't  
>>> mean
>>> that they serve as general html-fixing-routine. Or do you think the  
>>> rather
>>> long history and 75Kb of code for BS are because it's creator wasn't  
>>> aware
>>> of rexes?
>> I am, by no means, trying to trivialize the work that goes into
>> creating the numerous modules out there. However as a relatively
>> novice programmer trying to figure out something, the fact that these
>> modules are pushed on people with such zealous devotion that you take
>> offense at my desire to not use them gives me a bit of pause. I use
>> non-included modules for tasks that require them, when the capability
>> to do something clearly can't be done easily another way (eg.
>> MySQLdb). I am sure that there will be plenty of times where I will
>> use BeautifulSoup. In this instance, however, I was trying to solve a
>> specific problem which I attempted to lay out clearly from the
>> outset.
>>
>> I was asking this community if there was a simple way to use only the
>> tools included with Python to parse a bit of html.

There are lots of ways doing HTML parsing in Python. A common
one is e.g. using mxTidy to convert the HTML into valid XHTML
and then use ElementTree to parse the data.

http://www.egenix.com/files/python/mxTidy.html
http://docs.python.org/lib/module-xml.etree.ElementTree.html

For simple tasks you can also use the HTMLParser that's part
of the Python std lib.

http://docs.python.org/lib/module-HTMLParser.html

Which tools to use is really dependent on what you are
trying to solve.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 23 2008)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611