utf8 and ftplib

Fri Jun 17 09:42:47 EDT 2005

"Richard Lewis" <richardlewis at fastmail.co.uk> wrote in message 
news:mailman.568.1119001245.10512.python-list at python.org...
>
> On Thu, 16 Jun 2005 12:06:50 -0600, "John Roth"
> <newsgroups at jhrothjr.com> said:
>> "Richard Lewis" <richardlewis at fastmail.co.uk> wrote in message
>> news:mailman.540.1118935910.10512.python-list at python.org...
>> > Hi there,
>> >
>> > I'm having a problem with unicode files and ftplib (using Python 
>> > 2.3.5).
>> >
>> > I've got this code:
>> >
>> > xml_source = codecs.open("foo.xml", 'w+b', "utf8")
>> > #xml_source = file("foo.xml", 'w+b')
>> >
>> > ftp.retrbinary("RETR foo.xml", xml_source.write)
>> > #ftp.retrlines("RETR foo.xml", xml_source.write)
>> >
>>
>> It looks like there are at least two problems here. The major one
>> is that you seem to have a misconception about utf-8 encoding.
>>
> Who doesn't? ;-)

Lots of people. It's not difficult to understand, it just takes a
bit of attention to the messy details.

The basic concept is that Unicode is _always_ processed using
a unicode string _in the program_. On disk or across the internet,
it's _always_ stored in an encoded form, frequently but not always
utf-8.  A regular string _never_ stores raw unicode; it's always
some encoding.

When you read text data from the internet, it's _always_ in some
encoding. If that encoding is one of the utf- encodings, it needs
to be converted to unicode to be processed, but it does not need
to be changed at all to write it to disk.

>> Whatever program you are using to read it has to then decode
>> it from utf-8 into unicode. Failure to do this is what is causing
>> the extra characters on output.
>>
>
>>
>> Amusingly, this would have worked:
>>
>> xml_source = codecs.EncodedFile("foo.xml", "utf-8", "utf-8")
>>
>> It is, of course, an expensive way of doing nothing, but
>> it at least has the virtue of being good documentation.
>>
> OK, I've fiddled around a bit more but I still haven't managed to get it
> to work. I get the fact that its not the FTP operation thats causing the
> problem so it must be either the xml.minidom.parse() function (and
> whatever sort of file I give that) or the way that I write my results to
> output files after I've done my DOM processing. I'll post some more
> detailed code:

Please post _all_ of the relevant code. It wastes people's time
when you post incomplete examples. The critical issue is frequently
in the part that you didn't post.

>
> def open_file(file_name):
>    ftp = ftplib.FTP(self.host)
>    ftp.login(self.login, self.passwd)
>
>    content_file = file(file_name, 'w+b')
>    ftp.retrbinary("RETR " + self.path, content_file.write)
>    ftp.quit()
>    content_file.close()
>
>    ## Case 1:
>    #self.document = parse(file_name)
>
>    ## Case 2:
>    #self.document = parse(codecs.open(file_name, 'r+b', "utf-8"))
>
>    # Case 3:
>    content_file = codecs.open(file_name, 'r', "utf-8")
>    self.document = parse(codecs.EncodedFile(content_file, "utf-8",
>    "utf-8"))
>    content_file.close()
>
> In Case1 I get the incorrectly encoded characters.
>
> In Case 2 I get the exception:
> "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in
> position 5208: ordinal not in range(128)"
> when it calls the xml.minidom.parse() function.
>
> In Case 3 I get the exception:
> "UnicodeEncodeError: 'ascii' codec can't encode character u'\xe6' in
> position 5208: ordinal not in range(128)"
> when it calls the xml.minidom.parse() function.

That's exactly what you should expect. In the first case, the file
on disk is encoded as utf-8, and this is aparently what mini-dom
is expecting.

The documentation shows a simple read, it does not show any
kind of encoding or decoding.

> Anyway, later on in the program I create a *very* large unicode string
> after doing some playing with the DOM tree. I then write this to a file
> using:
> html_file = codecs.open(file_name, "w+b", "utf8")
> html_file.write(very_large_unicode_string)
>
> The problem could be here?

That should work. The problem, as I said in the first post,
is that whatever program you are using to render the file
to screen or print is _not_ treating the file as utf-8 encoded.
It either needs to be told that the file is in utf-8 encoding,
or you need to get a better rendering program.

Many renderers, including most renderers inside of
programming tools like file inspectors and debuggers,
assume that the encoding is latin-1 or windows-1252.
This will throw up funny characters if you try to read
a utf-8 (or any multi-byte encoded) file using them.

One trick that sometimes works is to insure that the first
character is the BOM (byte order mark, or unicode signature).
Properly written Windows programs will use this as an
encoding signature. Unixoid programs frequently won't,
but that's arguably a violation of the Unicode standard.
This is a single unicode character which  is three characters
in utf-8 encoding.

John Roth

>
> Cheers,
> Richard