[Python-Dev] Is this a bug of the HTMLParser?

Michael Foord fuzzyman at voidspace.org.uk
Wed Nov 11 17:24:02 CET 2009


Hello Zhang Chiyuan,

Can you file a bug on the Python issue tracker please:

    http://bugs.python.org

Thanks

Michael Foord

Zhang Chiyuan wrote:
> Hi all,
>
> I'm using BeautifulSoup to parsing an HTML page and find it refused to
> parse the page. By looking at the backtrace, I found it is a problem
> with the python built-in HTMLParser.py. In fact, the web page I'm
> parsing is with some Chinese characters. there is a tag like <img
> src=/foo/bar.png alt=中文> , note this is legacy html page where the
> attributes are not quoted. However, the regexp defined in
> HTMLParser.py is :
>
>  attrfind = re.compile(
>      r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
>      r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?')
>
> Note that the Chinese character (also any other non-english
> characters), so it fire an error parsing this. I'm not sure whether
> the HTML standard allow un-quoted non-ASCII characters in the
> attributes. If it allows, this seems to be a bug. and the regexp to
> better be [^>\s] IMHO.
>
> BTW: It seems something like :
>
> <script>
> var st = "<a></";
> </script>
>
> can not be parsed. :-/
>
> --
> pluskid
> http://blog.pluskid.org
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
>   


-- 
http://www.ironpythoninaction.com/



More information about the Python-Dev mailing list