ElementTree: can't figure out a mismached-tag error

Thu Jul 11 08:49:56 EDT 2013

On Thursday, July 11, 2013 8:25:13 PM UTC+8, F.R. wrote:
> On 07/11/2013 10:59 AM, F.R. wrote:
> 
> > Hi all,
> 
> >
> 
> > I haven't been able to get up to speed with XML. I do examples from 
> 
> > the tutorials and experiment with variations. Time and time again I 
> 
> > fail with errors messages I can't make sense of. Here's the latest 
> 
> > one. The url is "http://finance.yahoo.com/q?s=XIDEQ&ql=0". Ubuntu 
> 
> > 12.04 LTS, Python 2.7.3 (default, Aug  1 2012, 05:16:07) [GCC 4.6.3]
> 
> >
> 
> > >>> import xml.etree.ElementTree as ET
> 
> > >>> tree = ET.parse('q?s=XIDEQ')  # output of wget 
> 
> > http://finance.yahoo.com/q?s=XIDEQ&ql=0
> 
> > Traceback (most recent call last):
> 
> >   File "<pyshell#69>", line 1, in <module>
> 
> >     tree = ET.parse('q?s=XIDEQ')
> 
> >   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1183, in parse
> 
> >     tree.parse(source, parser)
> 
> >   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
> 
> >     parser.feed(data)
> 
> >   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1643, in feed
> 
> >     self._raiseerror(v)
> 
> >   File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1507, in 
> 
> > _raiseerror
> 
> >     raise err
> 
> > ParseError: mismatched tag: line 9, column 2
> 
> >
> 
> > Below first nine lines. The line numbers and the following space are 
> 
> > hand-edited in. Three dots stand for sections cut out to fit long 
> 
> > lines. Line 6 is a bunch of "meta" statements, all of which I show on 
> 
> > a separate line each in order to preserve the angled brackets. On all 
> 
> > lines the angled brackets have been preserved. The mismatched 
> 
> > character is the slash of the closing tag </head>. What could be wrong 
> 
> > with it? And if it is, what about fault tolerance?
> 
> >
> 
> > 1 <!DOCTYPE html PUBLIC "-//W3C//DTD  . . . /strict.dtd">
> 
> > 2 <html lang="en-US">
> 
> > 3 <head><meta http-equiv="Content-Type" content="text/html; 
> 
> > charset=utf-8">
> 
> > 4 <title>XIDEQ: Summary for EXIDE TECH NEW- Yahoo! Finance</title>
> 
> > 5 <meta name="description" xml:space="default" content="View the basic 
> 
> > XIDEQ . . .
> 
> > 6 . . . other companies."><meta name="keywords" content="XIDEQ, EXIDE 
> 
> > TECH . . .">
> 
> >   <meta property="fb:app_id" content="118155468215844">
> 
> >   <meta property="fb:admins" content="503762770,100001149693905">
> 
> >   <meta property="og:type" content="company">
> 
> >   <meta property="og:site_name" content="Yahoo! Finance">
> 
> >   <meta property="og:title" content="Exide Technologies">
> 
> >   <meta property="og:image" 
> 
> > content="http://l.yimg.com/a/p/fi/31/09/00.jpg">
> 
> >   <meta property="og:url" content="http://finance.yahoo.com/q?s=XIDEQ">
> 
> >   <meta property="og:description" content="View the basic XIDEQ . . .
> 
> > 7 other companies."><link rel="canonical" 
> 
> > href="http://finance.yahoo.com/q?s=XIDEQ">
> 
> > 8 <link rel="stylesheet" href="http://l.yimg.com/zz/ . . . 
> 
> > type="text/css">
> 
> > 9 </head>
> 
> >    ^
> 
> >     Mismatch!
> 
> >
> 
> > Thanks for suggestions
> 
> >
> 
> > Frederic
> 
> >
> 
> Thank you all!
> 
> 
> 
> I was a little apprehensive it could be a silly mistake. And so it was. 
> 
> I have BeautifulSoup somewhere. Having had no urgent need for it I 
> 
> remember shirking the learning curve.
> 
> 
> 
> lxml seems to be a package with these components (from help (lxml)):
> 
> 
> 
> PACKAGE CONTENTS
> 
>      ElementInclude
> 
>      _elementpath
> 
>      builder
> 
>      cssselect
> 
>      doctestcompare
> 
>      etree
> 
>      html (package)
> 
>      isoschematron (package)
> 
>      objectify
> 
>      pyclasslookup
> 
>      sax
> 
>      usedoctest
> 
> 
> 
> I would start with "from lxml import html" and see what comes out.
> 
> 
> 
> Break time now. Thanks again!
> 
> 
> 
> Frederic

from lxml.html import parse
from lxml.etree import ElementTree
root = parse(target_url).getroot()

This'll get you the root node of the element tree parsed from the URL. The lxml html parser, conveniently enough, can combine in the actual web page access. If you want to control things like socket timeout, though, you'll have to use urllib to request the URL and then feed that to the parser.