beautifulSoup 4.1

Denis McMahon denismfmcmahon at gmail.com
Fri Mar 20 03:41:24 EDT 2015


On Fri, 20 Mar 2015 00:18:33 -0700, Sayth Renshaw wrote:

> Just finding it odd that the next sibling is a "\n" and not the next
> <td> otherwise that would be the perfect solution.

Whitespace between elements creates a node in the parsed document. This 
is correct, because whitespace between elements will be interpreted as 
whitespace by a browser.

<a href="blah1">text1</a><a href="blah2">text2</a>

will be displayed differently to

<a href="blah1">text1</a> <a href="blah2">text2</a>

in a browser, because the space between the <a> two elements in the 
second case is a text node in the dom.

A newline has the same effect (because to a browser for display purposes 
it's just whitespace) but in the dom the text node will contain the 
newline rather than a space.

bs4 tries to parse the html the same way a browser does, so you get all 
the text nodes, including the whitespace between elements which includes 
any newlines.

-- 
Denis McMahon, denismfmcmahon at gmail.com



More information about the Python-list mailing list