html parser , unexpected '<' char in declaration

Tim Roberts timr at probo.com
Tue Feb 21 02:54:35 EST 2006


"Jesus Rivero - (Neurogeek)" <jrivero at latinux.org> wrote:
>
>hmmm, that's kind of different issue then.
>
>I can guess, from the error you pasted earlier, that the problem shown
>is due to the fact Python is interpreting a "<" as an expression and not
>as a char. review your code or try to figure out the exact input you're
>receving within the mta.

Well, Jesus, you are 0 for 2.  Sakcee pointed out what the exact problem
was in his original message.  The HTML he is being given is ill-formed; the
<!DOCTYPE directive is not closed.  The SGML parser finds a <html> tag
which it thinks is inside the <!DOCTYPE, and that's illegal.

>> well probabbly I should explain more.  this is part of an email . after
>> the mta delivers the email, it is stored in a local dir.
>> After that the email is being parsed by the parser inside an web based
>> imap client at display time.
>> 
>> I dont think I have the choice of rewriting the message!? and I dont
>> want to reject the message alltogether.
>> 
>> I can either 1-fix the incoming html by tidying it up
>> or 2- strip only plain text out and dispaly that you have spam, 3 - or
>> ignore that mal-formatted tag and display the rest

If this is happening with more than one message, you could check for it
rather easily with a regular expression, or even just ''.find, and then
either insert a closing '>' or delete everything up to the <html> before
parsing it.
-- 
- Tim Roberts, timr at probo.com
  Providenza & Boekelheide, Inc.



More information about the Python-list mailing list