Regular expression problem

Asheesh Laroia pan-news at asheeshenterprises.com
Thu Feb 28 21:14:16 EST 2002


Actually, I think this is the most elegant solution I've seen so far.

Good thinking; I forgot to "Use the Source," as some put it.

Only one problem: the parser still balks on embedded tags, like:

	<@Trap Body text:<P><I><B>>
becomes
	>

It leaves an extra '>' character at the end.  Any suggestions?  I can
write a simple workaround for something like this, but it seems like
it should work "the right way."

Thanks for everything!

-- Asheesh.

On Thu, 28 Feb 2002 01:17:11 -0500, Sean 'Shaleh' Perry wrote:


> On 28-Feb-2002 Asheesh Laroia wrote:
>> I've been trying to use sgmllib, actually, to delete all the other
>> tags.
>> 
>> It just doesn't handle the <@ [...] > condition well.  It refuses to
>> parse it, treating it as text.
>> 
>> 
> The reason is this:
> 
> starttagopen = re.compile('<[>a-zA-Z]') tagfind =
> re.compile(r'[a-zA-Z][-_.a-zA-Z0-9]*')
> 
> near the top of sgmllib.py.
> 
> Changing them in your code will allow the parser to understand the tag.
> However there is another problem which requires more work.  When a tag
> is found the parser tries to run 'start_' + tag.  start_ at Trap() is not a
> valid python name.  You could redefine the function which calls the
> handlers so that it looks for perhaps start_atTrap().  This would allow
> you to use the SGMLParser for all of your parsing needs, but may also be
> overkill for the problem.



More information about the Python-list mailing list