[Tutor] Coding question

Alan Gauld alan.gauld at yahoo.co.uk
Sat Nov 28 17:08:15 EST 2020


On 28/11/2020 16:31, Erin Anderson wrote:
> Hello, I am trying to figure out how to code for reading in text from a URL in python but in two chunks.

There are ways to do that but it's not normal. A URL delivers a
stream (or streams) of data and you usually have to read the
entire stream. If you start chunking it up you will break
the HTML formatting (start/end tags etc) and make parsing
much more difficult. HTML is not line oriented.

> I am looking at a transcript from a website and I want 
> to read in the text but I want the reading in of the text 
> to stop when the transcript says “Part 2”, 

The usual way is to read the entire stream then find the bits
you want within that using an HTML  parser. It does mean all
the text is in memory but on any modern computer that's not
usually an issue!

> I then want to have this chunk of information as one entity 
> and then create another entity filled with the text that 
> occurs after the words “Part 2”. 

In that case it's all in memory anyway so you might as well
use the standard tools and save yourself a world of pain
and anguish!


> Im thinking one way to do this is using a while loop, 

That's almost certainly wrong. Use an HTML parser - either
BeautifulSoup or the standard library html.parser module(*).
Use that to find the tag/class that you are looking for
and select the elements you need.

On the other hand if you are only interested in the text
content rather than the document structure then export
the html as plain text and use regular text processing
tools to search/slice it. But usually the parser approach
will be faster and easier.

> Def text_chunk(url)
> 	webpage=web.urlopen(url)
> 	while text != “Part 2”:

You haven't dfined text anywhere so this will fail
with an error. And if it passed you'd just read the
entire page into text each time round the loop.
Just do it once.

> 		rawbytes=webpage.read()

At this point you have already read the entire page into
memory so there is no point in trying to stop the read.

> 		webpage.close()
> 		text = rawBytes.decode('utf-8’)
> 	return text

Now you have text as a string you can try to find your
string within it.

But better would be to read the bytes into a parser
then use that to pull out the bits you need. Since
you don't describe what you want to do (other than
split the page) we can't really advise how to
proceed beyond that.

The more specific you are about what you are trying
to do (rather than how you are trying to do it!) the
more likely we are to be able to help.

(*)There is a basic intro on how to use the html.parser
module in my "Writing Web Clients" topic of my
tutorial(see below)

-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list