[Tutor] Get lines of web page properly segmented

Fri Oct 22 12:06:24 EDT 2021

Hey,

This is something I have been researching for a long time, and it’s
surprisingly challenging. I would really appreciate anybody who can help me
finally resolve this question.

I want to segment texts which sometimes have text in them that doesn’t end
in a period - for example, the title of a section, or lines of code.

Usually when I retrieve text from a webpage, the sentences are broken up
with newlines, like this:

And Jeffrey went
to Paris that
weekend to meet
his family.

I cannot segment text on newlines if the sentences are broken by newlines,
but I need to go preserve the separation between different lines of code,
like:

print(x)
x = 3
quit()

I think I can either focus on getting the text from the source in a higher
quality format, so that the sentences are already connected and not broken,
or I have to find an efficient way to automatically join broken sentences
but nothing else.

I thought I could get better quality text by using Beautiful Soup, but I
just tried the .get_text() method and I was surprised to find that the
sentences are still broken by newlines. Maybe there are newlines even in
the HTML, or maybe there were HTML tags embedding links in the text, and
Beautiful Soup adds newlines when it extracts text.

Can anyone provide a working example of extracting webpage content so that
sentences are not broken with newlines?

Or if this is inevitable, what is an effective way to join broken sentences
automatically but nothing else? I think I’ll need AI for this.

Anyone who can help me, I really appreciate it.

Thanks very much,
Julius