回复: getroot() problem

水静流深 1248283536 at qq.com
Mon Oct 24 00:30:41 EDT 2011


in  my computer,there two os ,
1.xp+python32
import lxml.html
sfile='http://finance.yahoo.com/q/op?s=A+Options' root=lxml.html.parse(sfile).getroot()
 it is ok
 import lxml.html
sfile='http://frux.wikispaces.com/'
root=lxml.html.parse(sfile).getroot()
there is problem 

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\lib\site-packages\lxml\html\__init__.py", line 692, in parse

    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
  File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:5
4187)
  File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etre
e.c:79485)
  File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lx
ml.etree.c:79768)
  File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile (src/lxml/lxml.e
tree.c:78843)
  File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile (src/
lxml/lxml.etree.c:75698)
  File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDo
c (src/lxml/lxml.etree.c:71739)
  File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.e
tree.c:72614)
  File "parser.pxi", line 583, in lxml.etree._raiseParseError (src/lxml/lxml.etr
ee.c:71927)
IOError: Error reading file 'b'http://frux.wikispaces.com/'': b'failed to load e
xternal entity "http://frux.wikispaces.com/"'

2. ubuntu11.04+python2.6
import lxml.html
 sfile='http://frux.wikispaces.com/'
 root=lxml.html.parse(sfile).getroot()
it is ok
it is so strange thing for me to understand
------------------ 原始邮件 ------------------
发件人: "Dave Angel"<d at davea.name>;
发送时间: 2011年10月24日(星期一) 上午9:22
收件人: "1248283536"<1248283536 at qq.com>; 
抄送: "lxml"<lxml at lxml.de>; "python-list"<python-list at python.org>; 
主题: Re: getroot()   problem

 
 On 10/23/2011 09:06 PM, 水静流深 wrote:
> C:\Documents and Settings\peng>cd c:\python32
>
>
>
> C:\Python32>python
>
> Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
>
> 32
>
> Type "help", "copyright", "credits" or "license" for more information.
>
>>>> import lxml.html
>
>>>> sfile='http://finance.yahoo.com/q/op?s=A+Options'
>
>>>> root=lxml.html.parse(sfile).getroot()
> there is no problem to  parse  :
>
>
> http://finance.yahoo.com/q/op?s=A+Options'
>
>
>
>
> why  i can not parse
>
> http://frux.wikispaces.com/  ??
>
>>>> import lxml.html
>
>>>> sfile='http://frux.wikispaces.com/'
>
>>>> root=lxml.html.parse(sfile).getroot()
>
> Traceback (most recent call last):
>
>    File "<stdin>", line 1, in<module>
>
>    File "C:\Python32\lib\site-packages\lxml\html\__init__.py", line 692, in parse
>
>
>
>      return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
>
>    File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:5
>
> 4187)
>
>    File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etre
>
> e.c:79485)
>
>    File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lx
>
> ml.etree.c:79768)
>
>    File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile (src/lxml/lxml.e
>
> tree.c:78843)
>
>    File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile (src/
>
> lxml/lxml.etree.c:75698)
>
>    File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDo
>
> c (src/lxml/lxml.etree.c:71739)
>
>    File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.e
>
> tree.c:72614)
>
>    File "parser.pxi", line 583, in lxml.etree._raiseParseError (src/lxml/lxml.etr
>
> ee.c:71927)
>
> IOError: Error reading file 'b'http://frux.wikispaces.com/'': b'failed to load e
>
> xternal entity "http://frux.wikispaces.com/"'
>
>>> >
Double-spacing makes your message much harder to read. I can only 
comment in a general way, in any case. most html is mal-formed, and not 
legal html. Although I don't have any experience with parsing it, I do 
with xml which has similar problems.

The first thing I'd do is to separate the loading of the byte string 
from the website, from the parsing of those bytes. Further, I'd make a 
local copy of those bytes, so you can do testing repeatably. For 
example, you could run wget utility to copy the bytes locally and create 
a file.
-- 

DaveA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20111024/771ba7ba/attachment.html>


More information about the Python-list mailing list