encoding problems

Wed Aug 29 04:48:25 EDT 2007

tool69 wrote:

> Hi,
> 
> I would like to transform reST contents to HTML, but got problems
> with accented chars.
> 
> Here's a rather simplified version using SVN Docutils 0.5:
> 
> %-------------------------------------------------------------
> 
> #!/usr/bin/env python
> # -*- coding: utf-8 -*-

This declaration only affects unicode-literals.

> from docutils.core import publish_parts
> 
> class Post(object):
>      def __init__(self, title='', content=''):
>          self.title = title
>          self.content = content
> 
>      def _get_html_content(self):
>          return publish_parts(self.content,
>              writer_name="html")["html_body"]
>      html_content = property(_get_html_content)

Did you know that you can do this like this:

@property
def html_content(self):
    ...

?

> # Instanciate 2 Post objects
> p1 = Post()
> p1.title = "First post without accented chars"
> p1.content = """This is the first.
> ...blabla
> ... end of post..."""
> 
> p2 = Post()
> p2.title = "Second post with accented chars"
> p2.content = """Ce poste possède des accents : é à ê è"""

This needs to be a unicode-literal:

p2.content = u"""Ce poste possède des accents : é à ê è"""

Note the u in front.

> for post in [p1,p2]:
>      print post.title, "\n" +"-"*30
>      print post.html_content
> 
> %-------------------------------------------------------------
> 
> The output gives me :
> 
> First post without accented chars
> ------------------------------
> <div class="document">
> <p>This is the first.
> ...blabla
> ... end of post...</p>
> </div>
> 
> Second post with accented chars
> ------------------------------
> Traceback (most recent call last):
> File "C:\Documents and
> Settings\kib\Bureau\Projets\python\dbTest\rest_error.py", line 30, in
> <module>
> print post.html_content
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in
> position 39:
> ordinal not in range(128)

You need to encode a unicode-string into the encoding you want it.
Otherwise, the default (ascii) is taken.

So 

print post.html_content.encodec("utf-8")

should work.

Diez