utf8 and ftplib

Richard Lewis richardlewis at fastmail.co.uk
Fri Jun 17 05:40:42 EDT 2005


On Thu, 16 Jun 2005 12:06:50 -0600, "John Roth"
<newsgroups at jhrothjr.com> said:
> "Richard Lewis" <richardlewis at fastmail.co.uk> wrote in message 
> news:mailman.540.1118935910.10512.python-list at python.org...
> > Hi there,
> >
> > I'm having a problem with unicode files and ftplib (using Python 2.3.5).
> >
> > I've got this code:
> >
> > xml_source = codecs.open("foo.xml", 'w+b', "utf8")
> > #xml_source = file("foo.xml", 'w+b')
> >
> > ftp.retrbinary("RETR foo.xml", xml_source.write)
> > #ftp.retrlines("RETR foo.xml", xml_source.write)
> >
> 
> It looks like there are at least two problems here. The major one
> is that you seem to have a misconception about utf-8 encoding.
> 
Who doesn't? ;-)

> 
> Whatever program you are using to read it has to then decode
> it from utf-8 into unicode. Failure to do this is what is causing
> the extra characters on output.
> 

> 
> Amusingly, this would have worked:
> 
> xml_source = codecs.EncodedFile("foo.xml", "utf-8", "utf-8")
> 
> It is, of course, an expensive way of doing nothing, but
> it at least has the virtue of being good documentation.
> 
OK, I've fiddled around a bit more but I still haven't managed to get it
to work. I get the fact that its not the FTP operation thats causing the
problem so it must be either the xml.minidom.parse() function (and
whatever sort of file I give that) or the way that I write my results to
output files after I've done my DOM processing. I'll post some more
detailed code:

def open_file(file_name):
    ftp = ftplib.FTP(self.host)
    ftp.login(self.login, self.passwd)

    content_file = file(file_name, 'w+b')
    ftp.retrbinary("RETR " + self.path, content_file.write)
    ftp.quit()
    content_file.close()

    ## Case 1:
    #self.document = parse(file_name)

    ## Case 2:
    #self.document = parse(codecs.open(file_name, 'r+b', "utf-8"))

    # Case 3:
    content_file = codecs.open(file_name, 'r', "utf-8")
    self.document = parse(codecs.EncodedFile(content_file, "utf-8",
    "utf-8"))
    content_file.close()

In Case1 I get the incorrectly encoded characters.

In Case 2 I get the exception:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in
position 5208: ordinal not in range(128)"
when it calls the xml.minidom.parse() function.

In Case 3 I get the exception:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in
position 5208: ordinal not in range(128)"
when it calls the xml.minidom.parse() function.

The character at position 5208 is an 'a' (assuming Emacs' goto-char
function has the same idea about file positions as
xml.minidom.parse()?). When I first tried these two new cases it came up
with an unencodable character at another position. By replacing the
large dash at this position with an ordinary minus sign I stopped it
from raising the exception at that point in the file. I checked the
character xe6 and (assuming I know what I'm doing) its a small ae
ligature.

Anyway, later on in the program I create a *very* large unicode string
after doing some playing with the DOM tree. I then write this to a file
using:
html_file = codecs.open(file_name, "w+b", "utf8")
html_file.write(very_large_unicode_string)

The problem could be here?

Cheers,
Richard



More information about the Python-list mailing list