Unicode -> String problem

Michael Ströder michael at stroeder.com
Tue Jul 10 09:21:05 EDT 2001


Jay Parlar wrote:
> 
> My task is to create an HTML parser that will pull full text from HTML 
> documents,

Basically your parser has to honour the charset defined in HTTP
header or <meta> tag.

HTTP header:
Content-Type: text/html;charset=utf-8

HTML <head>:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

Your parser should use the denoted charset for converting the raw
strings to Unicode objects. Also HTML char entities have to be added
to the Unicode objects with same encoding.

> Now, whenever I'm given HTML from IE's cache, it is unicode. There is no doubt 
> about that.

Are you sure? Which encoding of Unicode? UTF-16, UTF-8, ...

Ciao, Michael.



More information about the Python-list mailing list