Unicode -> String problem
Michael Ströder
michael at stroeder.com
Tue Jul 10 09:21:05 EDT 2001
Jay Parlar wrote:
>
> My task is to create an HTML parser that will pull full text from HTML
> documents,
Basically your parser has to honour the charset defined in HTTP
header or <meta> tag.
HTTP header:
Content-Type: text/html;charset=utf-8
HTML <head>:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
Your parser should use the denoted charset for converting the raw
strings to Unicode objects. Also HTML char entities have to be added
to the Unicode objects with same encoding.
> Now, whenever I'm given HTML from IE's cache, it is unicode. There is no doubt
> about that.
Are you sure? Which encoding of Unicode? UTF-16, UTF-8, ...
Ciao, Michael.
More information about the Python-list
mailing list