xhtml encoding question

Wed Feb 1 03:26:15 EST 2012

Tim Arnold, 31.01.2012 19:09:
> I have to follow a specification for producing xhtml files.
> The original files are in cp1252 encoding and I must reencode them to utf-8.
> Also, I have to replace certain characters with html entities.
> 
> I think I've got this right, but I'd like to hear if there's something I'm
> doing that is dangerous or wrong.
> 
> Please see the appended code, and thanks for any comments or suggestions.
> 
> I have two functions, translate (replaces high characters with entities)
> and reencode (um, reencodes):
> ---------------------------------
> import codecs, StringIO
> from lxml import etree
> high_chars = {
>    0x2014:'—', # 'EM DASH',
>    0x2013:'–', # 'EN DASH',
>    0x0160:'Š',# 'LATIN CAPITAL LETTER S WITH CARON',
>    0x201d:'”', # 'RIGHT DOUBLE QUOTATION MARK',
>    0x201c:'“', # 'LEFT DOUBLE QUOTATION MARK',
>    0x2019:"’", # 'RIGHT SINGLE QUOTATION MARK',
>    0x2018:"‘", # 'LEFT SINGLE QUOTATION MARK',
>    0x2122:'™', # 'TRADE MARK SIGN',
>    0x00A9:'©',  # 'COPYRIGHT SYMBOL',
>    }
> def translate(string):
>    s = ''
>    for c in string:
>        if ord(c) in high_chars:
>            c = high_chars.get(ord(c))
>        s += c
>    return s

I hope you are aware that this is about the slowest possible algorithm
(well, the slowest one that doesn't do anything unnecessary). Since none of
this is required when parsing or generating XHTML, I assume your spec tells
you that you should do these replacements?

> def reencode(filename, in_encoding='cp1252',out_encoding='utf-8'):
>    with codecs.open(filename,encoding=in_encoding) as f:
>        s = f.read()
>    sio = StringIO.StringIO(translate(s))
>    parser = etree.HTMLParser(encoding=in_encoding)
>    tree = etree.parse(sio, parser)

Yes, you are doing something dangerous and wrong here. For one, you are
decoding the data twice. Then, didn't you say XHTML? Why do you use the
HTML parser to parse XML?

>    result = etree.tostring(tree.getroot(), method='html',
>                            pretty_print=True,
>                            encoding=out_encoding)
>    with open(filename,'wb') as f:
>        f.write(result)

Use tree.write(f, ...)

Assuming you really meant XHTML and not HTML, I'd just drop your entire
code and do this instead:

  tree = etree.parse(in_path)
  tree.write(out_path, encoding='utf8', pretty_print=True)

Note that I didn't provide an input encoding. XML is safe in that regard.

Stefan