[Python-Dev] Should ftplib use UTF-8 instead of latin-1 encoding?

Alexey G. python-3000 at udmvt.ru
Sat Jan 24 14:28:12 CET 2009


On Fri, Jan 23, 2009 at 11:18:37PM +0100, "Martin v. L?wis" wrote:
> > I don't see how starting with an empty directory helps.  The filename
> > comes from the client, and the FTP server can't know what the actual
> > encoding of that filename is.
> 
> Sure it can. If the client supports RFC 2640, it will send file names
> in UTF-8. If the client does not support RFC 2640, the client must
> restrict itself to 7-bit file names (i.e. ASCII). If the client violates
> the protocol, the server must respond with error 501.

  Perhaps, that is true, but that is in the world of standards. In my life I
remember the situation when users uploaded files from Windows with names in CP866
encoding to UNIX-based ftp server, which by itself had KOI8-R as the encoding
for LC_CTYPE. Since administrator was unhappy being impossible to read the
names of files correctly, he found and installed specialized ("russified")
version of ftp daemon, which had configuration settings, that said what is the
network encoding and what is the filesystem encoding.
  So both ftp daemon and ftp clients violated RFC, but users and administrator
were happy.

  I think, we should preserve the ability of ftp client to download all files he
see in the listing from the server.
What to do with user specified filenames when they cannot be encoded into ascii
and server does not support UTF8, but violates RFC and allows 8-bit bytes in the
file names?

  The ideal ftp client will ask the user about the encoding he thiks filenames are
stored on the server side and then recode from user's encoding. It also allow the
user to try several variants, if first don't work. It will allow user to download
files with names in several different encodings from the same server using single
ftp session.
  Dumb client will send filename from user as bytes,
and will succeed, if user was able to specify filename verbatim.
Anything between that will make the idea of using Unicode as character encoding
for filenames absurd, since it will only break the i18n capabilities of the library.

  If python library will have file name encoding hardwired to latin1, but arguments will
only be unicode strings, well, a lot of people will not even notice that, since they
use only ascii part of utf-8. But then there will be again numerous "russification"-like
patches to python and to modules, which are incompatible with everything, but work well
in some very specific situations. This is the evil that was supposed to be defeated
with i18n and with the total adoption of Unicode.

Alexey G.


More information about the Python-Dev mailing list