HTML parsing confusion

Tue Jan 22 16:20:32 EST 2008

On Jan 22, 11:39 am, "Diez B. Roggisch" <de... at nospam.web.de> wrote:
> Alnilam wrote:
> > On Jan 22, 8:44 am, Alnilam <alni... at gmail.com> wrote:
> >> > Pardon me, but the standard issue Python 2.n (for n in range(5, 2,
> >> > -1)) doesn't have an xml.dom.ext ... you must have the mega-monstrous
> >> > 200-modules PyXML package installed. And you don't want the 75Kb
> >> > BeautifulSoup?
>
> >> I wasn't aware that I had PyXML installed, and can't find a reference
> >> to having it installed in pydocs. ...
>
> > Ugh. Found it. Sorry about that, but I still don't understand why
> > there isn't a simple way to do this without using PyXML, BeautifulSoup
> > or libxml2dom. What's the point in having sgmllib, htmllib,
> > HTMLParser, and formatter all built in if I have to use use someone
> > else's modules to write a couple of lines of code that achieve the
> > simple thing I want. I get the feeling that this would be easier if I
> > just broke down and wrote a couple of regular expressions, but it
> > hardly seems a 'pythonic' way of going about things.
>
> This is simply a gross misunderstanding of what BeautifulSoup or lxml
> accomplish. Dealing with mal-formatted HTML whilst trying to make _some_
> sense is by no means trivial. And just because you can come up with a few
> lines of code using rexes that work for your current use-case doesn't mean
> that they serve as general html-fixing-routine. Or do you think the rather
> long history and 75Kb of code for BS are because it's creator wasn't aware
> of rexes?
>
> And it also makes no sense stuffing everything remotely useful into the
> standard lib. This would force to align development and release cycles,
> resulting in much less features and stability as it can be wished.
>
> And to be honest: I fail to see where your problem is. BeatifulSoup is a
> single Python file. So whatever you carry with you from machine to machine,
> if it's capable of holding a file of your own code, you can simply put
> BeautifulSoup beside it - even if it was a floppy  disk.
>
> Diez

I am, by no means, trying to trivialize the work that goes into
creating the numerous modules out there. However as a relatively
novice programmer trying to figure out something, the fact that these
modules are pushed on people with such zealous devotion that you take
offense at my desire to not use them gives me a bit of pause. I use
non-included modules for tasks that require them, when the capability
to do something clearly can't be done easily another way (eg.
MySQLdb). I am sure that there will be plenty of times where I will
use BeautifulSoup. In this instance, however, I was trying to solve a
specific problem which I attempted to lay out clearly from the
outset.

I was asking this community if there was a simple way to use only the
tools included with Python to parse a bit of html.

If the answer is no, that's fine. Confusing, but fine. If the answer
is yes, great. I look forward to learning from someone's example. If
you don't have an answer, or a positive contribution, then please
don't interject your angst into this thread.