[Tutor] Opening filenames with unicode characters

Thu Jun 28 23:47:23 CEST 2012

Thanks to everyone who responded on this thread, your time is greatly
appreciated.

It appears however that my problem is related to the environment. I
sent my original email right before leaving work and have since been
working on a physical machine without any problems. I've copied some
of that code to my remote virtual machine where I'm doing the dev work
and the same example that works on my physical win7 machine fails on
my virtual win 2008 machine. Win2008 host platform is Linux with
VirtualBox.

The only remaining question is whether this is a one off issue,
whether it's related to the virtual machine or whether it's related to
Windows 2008. I guess I'll find out tomorrow.

Oh and Tim, you'll be happy to know that regex does not affect the
string in this case. Well, at least not the way I'm using it to
extract data.

--
James

At Thursday, 28/06/2012 on 21:17 Tim Golden wrote:

On 28/06/2012 20:48, James Chapman wrote:
> The name of the file I'm trying to open comes from a UTF-16 encoded
> text file, I'm then using regex to extract the string (filename) I
> need to open.

OK. Let's focus on that. For the moment -- although it might
well be very relevant -- I'm going to ignore the regex side
of things. It's always trying to portray things like this
because there's such confusion between what characters I
write to represent the data and the data represented by those
characters themselves!

OK, let's adopt a convention whereby I represent the data as
they kind of thing you'd see in a hex editor. This obviously
isn't how it appear in a a text file but hopefully it'll be
clear what's going on.

I have a filename £10.txt -- that is the characters:

POUND SIGN
DIGIT ONE
DIGIT ZERO
FULL STOP
LATIN SMALL LETTER T
LATIN SMALL LETTER X
LATIN SMALL LETTER T

I have -- prior to your getting there -- placed this in a text
file which I guarantee is UTF16-encoded. For the purposes of
illustration I shall do that in Python code here:

with open ("filedata.dat", "wb") as f:
   f.write (u"£10.txt".encode ("utf16"))

The file is named "filedata.dat" and looks like this (per our
convention):

ff fe a3 00 31 00 30 00 2e 00 74 00 78 00 74 00

I now want to read the contents of the that file as a
filename and open the file in question. Here goes:

#
# Open the file and extract the data as a set of
# bytes into a Python (byte) string.
#
with open("filedata.dat", "rb") as f:
   data = f.read()

#
# Convert the data into a unicode object by decoding
# the UTF16 bytes
#
filename = data.decode("utf16")

# filename is now a unicode object which, depending on
# what your console offers, will either display as
# £10.txt or as \xa310.txt or as something else.

#
# Open that file by passing the unicode object directly
# to Python's file-opening mechanism
#
ten_pound_txt = open (filename, "rb")
print ten_pound_txt.read () # whatever
ten_pound_txt.close ()

I don't know if that makes anything clearer for you, but at
least it gives you something to try out.

The business with the regex clouds the issue: regex can play
a little awkwardly with Unicode, so you'd have to show some
code if you need help there.

TJG
_______________________________________________
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120628/c5446569/attachment.html>