How do I encode and decode this data to write to a file?

Mon Apr 29 06:33:16 EDT 2013

cl at isbd.net wrote:

> I am debugging some code that creates a static HTML gallery from a
> directory hierarchy full of images. It's this package:-
>     https://pypi.python.org/pypi/Gallery2.py/2.0
> 
> 
> It's basically working and does pretty much what I want so I'm happy to
> put some effort into it and fix things.
> 
> The problem I'm currently chasing is that it can't cope with directory
> names that have accented characters in them, it fails when it tries to
> write the HTML that creates the page with the thumbnails on.
> 
> The code that's failing is:-
> 
>         raw = os.path.join(directory, self.getNameNoExtension()) + ".html"
>         file = open(raw, "w")
>         file.write("".join(html).encode('utf-8'))
>         file.close()
> 
> The variable html is a list containing the lines of HTML to write to the
> file.  It fails when it contains accented characters (an é in this
> case).  Here's the traceback:-
> 
> Traceback (most recent call last):
>   File "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line
>   41, in run self._recurse() File
>   "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 272,
>   in _recurse os.path.walk(self.props["sourcedir"], self.processDir, None)
>   File "/usr/lib/python2.7/posixpath.py", line 246, in walk walk(name,
>   func, arg) File "/usr/lib/python2.7/posixpath.py", line 246, in walk
>   walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py", line 246,
>   in walk walk(name, func, arg) File "/usr/lib/python2.7/posixpath.py",
>   line 238, in walk func(arg, top, names) File
>   "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 263,
>   in processDir self.createGallery() File
>   "/usr/local/lib/python2.7/dist-packages/gallery/galleries.py", line 215,
>   in createGallery self.picturemanager.createPictureHTMLs(self.footer)
>   File "/usr/local/lib/python2.7/dist-packages/gallery/picturemanager.py",
>   line 84, in createPictureHTMLs
>   curPic.createPictureHTML(self.galleryDirectory, self.getStylesheet(),
>   self.fullsize, footer) File
>   "/usr/local/lib/python2.7/dist-packages/gallery/picture.py", line 361,
>   in createPictureHTML file.write("".join(html).encode('utf-8'))
>   UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
>   783: ordinal not in range(128)
> 
> 
> 
> If I understand correctly the encode() is saying that it can't
> understand the data in the html because there's a character 0xc3 in it.
> I *think* this means that the é is encoded in UTF-8 already in the
> incoming data stream (should be as my system is wholly UTF-8 as far as I
> know and I created the directory name).
> 
> So how do I change the code so I don't get the error?  Do I just
> decode() the data first and then encode() it?
> 

Note that you are getting a *UnicodeDecodeError*, not a UnicodeEncodeError. 
Try omitting the encode() step, i. e. instead of

>         file.write("".join(html).encode('utf-8'))

use

file.write(""join(html))

Background (applies to Python 2 only): the str type deals with bytes, not 
code points. The right thing to do is to use .decode(...) to convert from 
str to unicode and .encode(...) to convert from unicode to str. In Python 2 
however the str type has an encode(...) method which is basically equivalent 
to

class str:
   # imaginary python implementation of python2's str
   ...
   def encode(self, encoding):
       return self.decode("ascii").encode(encoding)

and is almost never called intentionally.

PS Python3 has relabeled unicode to str and thus uses unicode by default. 
str was renamed to bytes and the annoying bytes.encode() method is gone.