Regular expression problem
Asheesh Laroia
pan-news at asheeshenterprises.com
Thu Feb 28 21:14:16 EST 2002
Actually, I think this is the most elegant solution I've seen so far.
Good thinking; I forgot to "Use the Source," as some put it.
Only one problem: the parser still balks on embedded tags, like:
<@Trap Body text:<P><I><B>>
becomes
>
It leaves an extra '>' character at the end. Any suggestions? I can
write a simple workaround for something like this, but it seems like
it should work "the right way."
Thanks for everything!
-- Asheesh.
On Thu, 28 Feb 2002 01:17:11 -0500, Sean 'Shaleh' Perry wrote:
> On 28-Feb-2002 Asheesh Laroia wrote:
>> I've been trying to use sgmllib, actually, to delete all the other
>> tags.
>>
>> It just doesn't handle the <@ [...] > condition well. It refuses to
>> parse it, treating it as text.
>>
>>
> The reason is this:
>
> starttagopen = re.compile('<[>a-zA-Z]') tagfind =
> re.compile(r'[a-zA-Z][-_.a-zA-Z0-9]*')
>
> near the top of sgmllib.py.
>
> Changing them in your code will allow the parser to understand the tag.
> However there is another problem which requires more work. When a tag
> is found the parser tries to run 'start_' + tag. start_ at Trap() is not a
> valid python name. You could redefine the function which calls the
> handlers so that it looks for perhaps start_atTrap(). This would allow
> you to use the SGMLParser for all of your parsing needs, but may also be
> overkill for the problem.
More information about the Python-list
mailing list