Installing Parsers/Tree Builders to, and accessing these packages from Python2.7

Sun Nov 2 17:31:48 EST 2014

On 02/11/2014 21:59, Simon Evans wrote:
>
> Oh I don't mind quoting console output, I just thought I'd be sparing you
>
> unnecessary detail.
>
> output was going nicely as I input text from my 'Getting Started with
>
> Beautiful Soup' even when the author reckoned things would go wrong - due to
>
> lxml not being installed, things went right, because I had already installed
>
> it, re:
> ----------------------------------------------------------------------------
> page 17
> ----------------------------------------------------------------------------
> Python 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] on win
> 32
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import urllib2
>>>> from bs4 import BeautifulSoup
>>>> url = "http://www.packtpub.com/books"
>>>> page = urllib2.urlopen(url)
>>>> soup_packtpage = BeautifulSoup(page)
>>>> with open("foo.html","r") as foo_file:
> ... soup_foo = Soup(foo_file)
>    File "<stdin>", line 2
>      soup_foo = Soup(foo_file)
>             ^
> IndentationError: expected an indented block
>>>> soup_foo= BeautifulSoup("foo.html")
> ----------------------------------------------------------------------------
> page 18
> ----------------------------------------------------------------------------
>>>> print(soup_foo)
> <html><body><p>foo.html</p></body></html>
>>>> soup_url = BeautifulSoup("http://www.packtpub.com/books")
>>>> print(soup_url)
> <html><body><p>http://www.packtpub.com/books</p></body></html>
>>>> helloworld = "<p>Hello World</p>"
>>>> soup_string = BeautifulSoup(helloworld)
>>>> print(soup_string)
> <html><body><p>Hello World</p></body></html>
> ----------------------------------------------------------------------------
> page 19: no code in text on this page
> ----------------------------------------------------------------------------
> page 20
> ----------------------------------------------------------------------------
>>>> soup_xml = BeautifulSoup(helloworld,features= "xml")
>>>> soup_xml = BeautifulSoup(helloworld,"xml")
>>>> print(soup_xml)
> <?xml version="1.0" encoding="utf-8"?>
> <p>Hello World</p>
>>>> soup_xml = BeautifulSoup(helloworld,features = "xml")
>>>> print(soup_xml)
> <?xml version="1.0" encoding="utf-8"?>
> <p>Hello World</p>
>>>>
> ----------------------------------------------------------------------------
> Then on bottom of page 20 it says 'we should install the required parsers using easy-install,pip or setup.py install' but as I can't get the downloads of html or html5 parsers, text code halfway down returns statutory response regarding requisite parser needing to be installed, re:
> ----------------------------------------------------------------------------
> page 21
> ----------------------------------------------------------------------------
>>>> invalid_html = '<a invalid content'
>>>> soup_invalid_html = BeautifulSoup(invalid_html,'lxml')
>>>> print(soup_invalid_html)
> <html><body><a content="" invalid=""></a></body></html>
>>>> soup_invalid_html = BeautifulSoup(invalid_html,'html5lib')
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
>    File "C:\Python27\lib\site-packages\bs4\__init__.py", line 155, in __init__
>      % ",".join(features))
> ValueError: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library?
>>>>

Have you tried this from the command prompt?

pip install html5lib

And please do something about the extra newlines and single lined 
paragraphs above, there's no need for it all.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence