email, unicode, HTML, and removal thereof

Thu Oct 31 04:58:53 EST 2002

Andrew Dalke <adalke at mindspring.com> writes:

> This didn't work because I get complaints about having characters
> with ordinal value > 127.  I needed to change the "cStringIO" to
> use the following MyStringIO and change the htmllib.HTMLParser
> to MyHTMLParser.

Yes, cStringIO does not support Unicode.

> I interpret the problem to HTMLParser reading a hex
> escape and converting it to a string.  The way I do
> things above can create characters >127.  Then when
> it converts the string to unicode, it throws the exception.
> 
> My workaround solves this by forcing the string to be
> interpreted in latin-1 context.

This comes from this fragment in sgmllib:

    def handle_charref(self, name):
        """Handle character reference, no need to override."""
        try:
            n = int(name)
        except ValueError:
            self.unknown_charref(name)
            return
        if not 0 <= n <= 255:
            self.unknown_charref(name)
            return
        self.handle_data(chr(n))

> This solution doesn't feel correct.  For example, I assume
> Latin-1 but it could be in window's cp1252, so I'm not
> doing the charset correctly.

Actually, it is correct as far as it goes. All cp1252 data are already
converted to Unicode in your code, so there are no traces of cp1252
left.

If the document also contains a character reference, such as  
then handle_charref will convert it to chr(160). If you now interpret
this string as Latin-1, your interpretation is correct: it so happens
that the first 256 Unicode characters coincide with Latin-1.

Of course, the code will break if somebody comes along with €
it will invoke unknown_charref, which will discard the data.

So I would recommend to override 

  def handle_charref(self, name):
    try:
      c = unichr(int(name))
    except ValueError:
      c = '?'
    self.handle_data(c)

So you won't get any non-ASCII byte strings from character references
anymore. However, you will still get them from entity references, as
htmlentitydefs has, e.g.

'Aacute':   '\301'

If it would instead have

'Aacute':   u'\301'

you won't need to redefine StringIO. htmlentitydefs has other
problems, e.g. it contains

    'Alpha':    'Α',

This would be copied literally into the output, instead of expanding
the character reference.

So I guess you should modify entitydefs as follows:

for k,v in entitydefs.items():
  if v.startswith('&#'):
    v = int(v[2:-1])
  else:
    v = ord(v)
  entitydefs[k] = unichr(v)

With these changes, you should not need a specialized StringIO
anymore.

Regards,
Martin