[Tutor] beautiful soup raw text workarounds?

nathan Smith nathan-tech at hotmail.com
Wed Aug 25 14:39:11 EDT 2021


Hi! I sure can :)


So as stated previously:

tags = soup.find_all()


Will get you a list of the tags in some html text, however, raw text, EG 
that which is not in a tag is something else.


The method I used, and will explain below, is probably unnecessary as 
BeautifulSoup arranges itself in a tree like state, so to access the 
body tag it's soup.html.body but for my purposes what I did was:


1. Run from the top of the tree downward, collecting children on the way 
and compile them into a list:


def extract_tags(element):

  t=[element] # include the parent object

  if(type(element)==bs4.Comment or type(element)==bs4.Stylesheet or 
type(element)==bs4.element.NavigableString):

   return t # These do not and  can not have children

  for child in element.children:

   t.extend(extract_tags(child))

  return t


The function above recursively gets all the elements from a parent so to 
get all the elements (elements being tags and raw strings) you simply do:


soup=BeautifulSoup(your_html_code)

full_list=extract_tags(soup)


Then if you wanted to list only raw strings you could do:


for x in full_list:

  if(type(x)==bs4.element.NavigableString):

   print(str(x.string))


You have to use str(x.string) because Beautiful soup has it's own 
subclass of string (I think that's the correct terminology) and from my 
experience today, python will throw a fit if you try and combine it with 
a regular string (for obvious reasons I guess, they're not the same type 
of object).


I hope this helps someone! :)

Nathan


On 25/08/2021 12:34, Alan Gauld via Tutor wrote:
> On 24/08/2021 21:15, nathan Smith wrote:
>> I actually fixed this myself.
> Good, but it would be useful to share how, for the
> benefit of future readers...
>
> Surely "raw text" is still inside a tag, even if
> its only the top level <body> tag?
>
>> tags=soup.find_all()
>> which returns tags only.
>>
>> Raw text are not tags.
> So how did you extract it?
>


More information about the Tutor mailing list