[Tutor] Removing Content from Lines....

Martin A. Brown martin at linux-ip.net
Thu Mar 24 21:59:57 EDT 2016


Greetings Sam,

>Hello,I am hoping you experts can help out a newbie.I have written 
>some python code to gather a bunch of sentences from many files. 
>These sentences contain the content:
>
>blah blah blah blah <uicontrol>1-up printing</uicontrol>blah blah blah blah blah blah blah blah
>blah blah <uicontrol>Preset</uicontrol>blah blah blah blah
>blah blah blah <uicontrol>Preset</uicontrol> blah blah blah
>
>What I want to do is remove the words before the <uicontrol> and 
>after the </uicontrol>. How do I do that in Python?  Thanks for any 
>and all help.

That looks like DocBook markup to me.

If you actually wish only to mangle the strings, you can do the 
following:

  def show_just_uicontrol_elements(fin):
      for line in fin:
          s = line.index('<uicontrol>')
          e = line.index('</uicontrol>') + len('</uicontrol>')
          print(line[s:e])

(Given your question, I'm assuming you can figure out how to open a 
file and read line by line in Python.)

If it's XML, though, you might consider more carefully Alan's 
admonitions and his suggestion of lxml, as a bit of a safer choice 
for handling XML data.

  def listtags(filename, tag=None):
      if tag is None:
          sys.exit("Provide a tag name to this function, e.g. uicontrol")
      doc = lxml.etree.parse(filename)
      sought = list(doc.getroot().iter(tag))
      for element in sought:
          print(element.tag, element.text)

If you were to call the above with:

  listtags('/path/to/the/data/file.xml', 'uicontrol')

You should see the name 'uicontrol' and the contents of each tag, 
stripped of all surrounding context.

The above snippet is really just an example to show you how easy it 
is (from a coding perspective) to use lxml.  You still have to make 
the investment to understand how lxml processes XML and what your 
data processing needs are.

Good luck in your explorations,

-Martin

-- 
Martin A. Brown
http://linux-ip.net/


More information about the Tutor mailing list