[Tutor] Opening filenames with unicode characters

Tim Golden mail at timgolden.me.uk
Thu Jun 28 22:17:40 CEST 2012


On 28/06/2012 20:48, James Chapman wrote:
> The name of the file I'm trying to open comes from a UTF-16 encoded
> text file, I'm then using regex to extract the string (filename) I
> need to open.


OK. Let's focus on that. For the moment -- although it might
well be very relevant -- I'm going to ignore the regex side
of things. It's always trying to portray things like this
because there's such confusion between what characters I
write to represent the data and the data represented by those
characters themselves!

OK, let's adopt a convention whereby I represent the data as
they kind of thing you'd see in a hex editor. This obviously
isn't how it appear in a a text file but hopefully it'll be
clear what's going on.

I have a filename £10.txt -- that is the characters:

POUND SIGN
DIGIT ONE
DIGIT ZERO
FULL STOP
LATIN SMALL LETTER T
LATIN SMALL LETTER X
LATIN SMALL LETTER T

I have -- prior to your getting there -- placed this in a text
file which I guarantee is UTF16-encoded. For the purposes of
illustration I shall do that in Python code here:

<code>
with open ("filedata.dat", "wb") as f:
   f.write (u"£10.txt".encode ("utf16"))

</code>

The file is named "filedata.dat" and looks like this (per our convention):

ff fe a3 00 31 00 30 00 2e 00 74 00 78 00 74 00

I now want to read the contents of the that file as a
filename and open the file in question. Here goes:

<code>
#
# Open the file and extract the data as a set of
# bytes into a Python (byte) string.
#
with open("filedata.dat", "rb") as f:
   data = f.read()

#
# Convert the data into a unicode object by decoding
# the UTF16 bytes
#
filename = data.decode("utf16")

# filename is now a unicode object which, depending on
# what your console offers, will either display as
# £10.txt or as \xa310.txt or as something else.

#
# Open that file by passing the unicode object directly
# to Python's file-opening mechanism
#
ten_pound_txt = open (filename, "rb")
print ten_pound_txt.read () # whatever
ten_pound_txt.close ()

</code>

I don't know if that makes anything clearer for you, but at
least it gives you something to try out.

The business with the regex clouds the issue: regex can play
a little awkwardly with Unicode, so you'd have to show some
code if you need help there.

TJG


More information about the Tutor mailing list