回复: getroot() problem
水静流深
1248283536 at qq.com
Mon Oct 24 00:30:41 EDT 2011
in my computer,there two os ,
1.xp+python32
import lxml.html
sfile='http://finance.yahoo.com/q/op?s=A+Options' root=lxml.html.parse(sfile).getroot()
it is ok
import lxml.html
sfile='http://frux.wikispaces.com/'
root=lxml.html.parse(sfile).getroot()
there is problem
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python32\lib\site-packages\lxml\html\__init__.py", line 692, in parse
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:5
4187)
File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etre
e.c:79485)
File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lx
ml.etree.c:79768)
File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile (src/lxml/lxml.e
tree.c:78843)
File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile (src/
lxml/lxml.etree.c:75698)
File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDo
c (src/lxml/lxml.etree.c:71739)
File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.e
tree.c:72614)
File "parser.pxi", line 583, in lxml.etree._raiseParseError (src/lxml/lxml.etr
ee.c:71927)
IOError: Error reading file 'b'http://frux.wikispaces.com/'': b'failed to load e
xternal entity "http://frux.wikispaces.com/"'
2. ubuntu11.04+python2.6
import lxml.html
sfile='http://frux.wikispaces.com/'
root=lxml.html.parse(sfile).getroot()
it is ok
it is so strange thing for me to understand
------------------ 原始邮件 ------------------
发件人: "Dave Angel"<d at davea.name>;
发送时间: 2011年10月24日(星期一) 上午9:22
收件人: "1248283536"<1248283536 at qq.com>;
抄送: "lxml"<lxml at lxml.de>; "python-list"<python-list at python.org>;
主题: Re: getroot() problem
On 10/23/2011 09:06 PM, 水静流深 wrote:
> C:\Documents and Settings\peng>cd c:\python32
>
>
>
> C:\Python32>python
>
> Python 3.2.2 (default, Sep 4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win
>
> 32
>
> Type "help", "copyright", "credits" or "license" for more information.
>
>>>> import lxml.html
>
>>>> sfile='http://finance.yahoo.com/q/op?s=A+Options'
>
>>>> root=lxml.html.parse(sfile).getroot()
> there is no problem to parse :
>
>
> http://finance.yahoo.com/q/op?s=A+Options'
>
>
>
>
> why i can not parse
>
> http://frux.wikispaces.com/ ??
>
>>>> import lxml.html
>
>>>> sfile='http://frux.wikispaces.com/'
>
>>>> root=lxml.html.parse(sfile).getroot()
>
> Traceback (most recent call last):
>
> File "<stdin>", line 1, in<module>
>
> File "C:\Python32\lib\site-packages\lxml\html\__init__.py", line 692, in parse
>
>
>
> return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
>
> File "lxml.etree.pyx", line 2942, in lxml.etree.parse (src/lxml/lxml.etree.c:5
>
> 4187)
>
> File "parser.pxi", line 1528, in lxml.etree._parseDocument (src/lxml/lxml.etre
>
> e.c:79485)
>
> File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL (src/lxml/lx
>
> ml.etree.c:79768)
>
> File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile (src/lxml/lxml.e
>
> tree.c:78843)
>
> File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile (src/
>
> lxml/lxml.etree.c:75698)
>
> File "parser.pxi", line 564, in lxml.etree._ParserContext._handleParseResultDo
>
> c (src/lxml/lxml.etree.c:71739)
>
> File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.e
>
> tree.c:72614)
>
> File "parser.pxi", line 583, in lxml.etree._raiseParseError (src/lxml/lxml.etr
>
> ee.c:71927)
>
> IOError: Error reading file 'b'http://frux.wikispaces.com/'': b'failed to load e
>
> xternal entity "http://frux.wikispaces.com/"'
>
>>> >
Double-spacing makes your message much harder to read. I can only
comment in a general way, in any case. most html is mal-formed, and not
legal html. Although I don't have any experience with parsing it, I do
with xml which has similar problems.
The first thing I'd do is to separate the loading of the byte string
from the website, from the parsing of those bytes. Further, I'd make a
local copy of those bytes, so you can do testing repeatably. For
example, you could run wget utility to copy the bytes locally and create
a file.
--
DaveA
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20111024/771ba7ba/attachment.html>
More information about the Python-list
mailing list