Parsing markup.

Fri Nov 26 07:58:29 EST 2010

On Nov 26, 4:03 am, MRAB <pyt... at mrabarnett.plus.com> wrote:
> On 26/11/2010 03:28, Joe Goldthwaite wrote:
>  > I’m attempting to parse some basic tagged markup.  The output of the
>  > TinyMCE editor returns a string that looks something like this;
>  >
>  > <p>This is a paragraph with <b>bold</b> and <i>italic</i> elements in
>  > it</p><p>It can be made up of multiple lines separated by pagagraph
>  > tags.</p>
>  >
>  > I’m trying to render the paragraph into a bit mapped image.  I need
>  > to parse it out into the various paragraph and bold/italic pieces.
>  > I’m not sure the best way to approach it.  Elementree and lxml seem
>  > to want a full formatted page, not a small segment like this one.
>  > When I tried to feed a line similar to the above to lxml I got an
>  > error; “XMLSyntaxError: Extra content at the end of the document”.
>  >

lxml works fine for me - have you tried:

from lxml import html
text = "<p>This is a paragraph with <b>bold</b> and <i>italic</i>
elements in it</p><p>It can be made up of multiple lines separated by
pagagraph tags.</p>"
tree = html.fromstring(text)
print tree.findall('p')
# should print [<Element p at 2b7b458>, <Element p at 2b7b3e8>]

hth

Jon