[Tutor] Removing Content from Lines....

Alan Gauld alan.gauld at btinternet.com
Thu Mar 24 21:12:41 EDT 2016


On 24/03/16 21:03, Sam Starfas via Tutor wrote:
> I have written some python code to gather a bunch
> of sentences from many files.
> These sentences contain the content:

OK, They are presumably lines and not sentences.
Lines end in a newline character and sentences end
with a period. So they are fundamentally different
and need to be handled differently so you need to
be precise.

> blah blah blah blah <uicontrol>1-up printing</uicontrol>blah blah blah blah blah blah blah blah
> blah blah <uicontrol>Preset</uicontrol>blah blah blah blah

Also they look like they might be out of an XML or HTML file?
In that case the easiest way is probably to use a parser for
the original data type (etree for XML, Beautiful Soup for HTML,
for examples) That's much easier than trying to do it by yourself.
That may involve going back a step and not extracting the lines out first...

If its not a recognised format like XML then you may need to
do it manually and in that case if the formatting is as precise
as you show(no extra spaces etc) then you can simply use string
methods to locate the end of the tags.

opentag = '<uicontrol>'
endtag='</uicontrol>'
start = my_string.find(openTag) + len(openTag)

find() will return the position of the opening <.
You can then add the length of the tag to get the start of your wanted text.

Similarly

end = my_string.find(endtag)

locates the start of the end tag.

You can then use string slicing to get the bit in betweeen.

data = my_string[start:end]

If the tags are not as clean then you might need to use regular
expressions to do it and that's a whole new level of complexity.
Things you need to be clear about:
1) are there any irregularities in how the tags are spelled?
   (eg. spaces, caps etc)
2) do the tags ever have attributes?
3) can there be multiple tags in a single line/sentence?
4) can tags be nested?
5) can tags cross line/sentence boundaries?

Without more detail that's the best I can offer.

hth
-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list