[Tutor] Struct and UTF-16

Mon Oct 3 12:18:46 CEST 2005

Hmm, looking at this, it seems I'm not the only one with this sort of problem.
http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf

Maybe I will just build a wall around these objects and declare
"none but unicode shall pass."

On 10/3/05, Liam Clarke <ml.cyresse at gmail.com> wrote:
> OK, one last kick.
>
> So, using
>
> val = unicode(value)
> self._slaveMap[attr].setPayload(value.encode("UTF-16"))
>
> I can stick normal strings in happily. Of course, as you mentioned,
> Kent, this leaves me vulnerable if the string differs to
> sys.getdefaultencoding().
>
> Other than directly from the user, the most likely source of data will
> be from pyid3lib, which for the time being assumes all strings are
> ISO-8859-1.
>
> http://pyid3lib.sourceforge.net
>
> Erk. Talk about big design up front. Would you recommend a different
> method of dealing with this? Basically, most of the strings in the
> database are UTF-16, and I just need to make them readable, and make
> sure any of the strings going in are UTF-16 as well.
>
> Alternatively, I've thought about just cycling through the various 100
> codecs until I don't get any UnicodeDecodeErrors, but that's no
> guarantee that it'll be human readble...oh dear.
>
> Thanks for any assistance offered.
>
> Liam Clarke
>
> On 10/3/05, Liam Clarke <ml.cyresse at gmail.com> wrote:
> > Hi,
> >
> > If I can just beat this horse one more time, can I just get
> > confirmation that I'm going about this the right way?
> >
> > I have a base object, which reads the unicode string as bytes like so,
> > this ignores all but important bits.
> >
> > class Mhod:
> >     def __init__(self, f):
> >         self.payload = struct.unpack("36s", f.read(36))
> >
> > Which in turn, is utilised in a Song object, which works like this -
> >
> > class Song:
> >     def __init__(self, mhod):
> >         self.location = unicode(mhod.payload, "UTF-16")
> >         self.mhod = mhod
> >     def gLoc(self):
> >         return self.location
> >     def sLoc(self, value):
> >         #Need to coerce data into UTF-16 here
> >         self.mhod.payload = value.encode("UTF-16")
> >
> >     location = property(gLoc, sLoc)
> >
> > If I were to do a
> >
> > >>>x = Mhod(open("test", "rb"))
> > >>>y = Song(x)
> >
> > I get
> >
> > >>>x.payload
> > ':\x00i\x00P\x00o\x00d\x00_\x00C\x00o\x00n\x00t\x00r\x00o\x00l
> > \x00:\x00M\x00u\x00s\x00i\x00c\x00:\x00F\x004\x004\x00:\x00L
> > \x00W\x00B\x00R\x00.\x00m\x00p\x003\x00' #Line breaks added.
> >
> > >>>y.location
> > u':iPod_Control:Music:F44:LWBR.mp3'
> >
> > Which is what I'm after. What I'm struggling with is coercing the
> > string that's being passed to sLoc() into UTF-16, and actually
> > creating any form of unicode string at all without using
> >
> > >>>foo = u'Monkies!'
> >
> > Which I'm sure is going to be in UTF-8, just to spite me.
> >
> > So far, the best I've come up with is -
> >
> > >>> foo = unicode("Hi Bob!".encode("UTF-16"), "UTF-16")
> >
> > Which, as you mention above, is likely to cause me errors. And
> > apparently "Hi Bob!" is an 8 bit string encoded in UTF-16...
> >  *sigh* I suppose I could go the XP route and expect any further users
> > to just deal with it and pass in a UTF-16 string, but there's got to
> > be a simple way to handle it., and I'm not having too much luck with
> > this.
> >
> > I've been working from the below document, if anyone can recommend
> > something further, I'd much appreciate it.
> >
> > http://www.amk.ca/python/howto/unicode
> >
> > Regards,
> >
> > Liam Clarke
> > On 10/3/05, Liam Clarke <ml.cyresse at gmail.com> wrote:
> > > Thanks Kent,
> > >
> > > My first time dealing with Python and unicode vs 'normal' strings, I
> > > do look forward to Python 3.0... at the moment I'm just trying to
> > > understand how to use UTF-16.
> > >
> > > Basically, I have data which is coming straight from struct.unpack()
> > > and it's an UTF-16 string, and I'm just trying to get my head around
> > > dealing with the data coming in from struct, and putting my data out
> > > through struct.
> > >
> > > It doesn't help overly that struct considers all strings to consist of
> > > one byte per char, whereas UTF-16 is two. And I was having trouble as
> > > to how to write UTF-16 stuff out properly.
> > >
> > > But, if I understand it correctly, I could use
> > >
> > > j = #some unicode string
> > > out = j.encode("UTF-16")
> > > pattern = "%ds" % len(out)
> > > struct.pack(pattern, out)
> > >
> > > without too much difficulty.
> > >
> > > Regards,
> > >
> > > Liam Clarke
> > >
> > > On 10/3/05, Kent Johnson <kent37 at tds.net> wrote:
> > > > Liam Clarke wrote:
> > > > > What's the difference between
> > > > >
> > > > > x = "Hi"
> > > > > y = x.encode("UTF-16")
> > > > >
> > > > > and
> > > > >
> > > > > y = unicode(x, "UTF-16")
> > > >
> > > > They are more-or-less opposite.
> > > >
> > > > encode() converts away from unicode. (Think of unicode as the 'normal' format, anything else in 'encoded'.) Normally it is used on a unicode string, not a byte string. It means, "interpret this string as unicode, then convert it to an encoded byte string using the given encoding".
> > > >
> > > > When you encode a non-unicode string (like "Hi"), the string is first converted to unicode (decoded) using sys.getdefaultencoding(), then encoded using the supplied encoding. So
> > > > 'Hi'.encode('utf-16')
> > > > is the same as
> > > > 'Hi'.decode(sys.getdefaultencoding()).encode('utf-16')
> > > >
> > > > In either case, the result is a string in UTF-16 encoding:
> > > >  >>> 'Hi'.encode('UTF-16')
> > > > '\xff\xfeH\x00i\x00'
> > > >  >>> 'Hi'.decode(sys.getdefaultencoding()).encode('utf-16')
> > > > '\xff\xfeH\x00i\x00'
> > > >
> > > > Note that the utf-16 codec puts a byte-order mark ('\xff\xfe') in the output; then 'H' becomes 'H\x00' and 'i' becomes 'i\x00'.
> > > >
> > > > Because sys.getdefaultencoding() is used to convert to unicode, you will get an error if the original string cannot be decoded with this encoding:
> > > >
> > > >  >>> '\xe3'.encode('utf-16')
> > > > Traceback (most recent call last):
> > > >   File "<stdin>", line 1, in ?
> > > > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)
> > > >
> > > >
> > > > What about unicode('Hi', 'utf-16')? This doesn't do anything useful:
> > > >  >>> unicode('Hi', 'UTF-16')
> > > > u'\u6948'
> > > >
> > > > unicode('Hi', 'utf-16') means the same as 'Hi'.decode('utf-16'). In this case we are saying, "Interpret this string as an encoded byte string in the given encoding, and convert it to a unicode string." Since 'Hi' is not, in fact, a byte string encoded in UTF-16, the results are not very useful.
> > > >
> > > >
> > > > To summarize:
> > > > If you have an encoded byte string and you want a unicode string, use str.decode() or unicode()
> > > >
> > > > If you have a unicode string and you want an encoded byte string, use unicode.encode().
> > > >
> > > > If you are using str.encode() you probably haven't though through your problem completely and you will likely get UnicodeDecodeErrors when you have non-ASCII data.
> > > >
> > > >
> > > > If you are writing a unicode-aware application, a good strategy is to keep all strings internally as unicode and to convert to and from the required encodings at the boundaries.
> > > >
> > > > Kent
> > > >
> > > > _______________________________________________
> > > > Tutor maillist  -  Tutor at python.org
> > > > http://mail.python.org/mailman/listinfo/tutor
> > > >
> > >
> >
>