[Tutor] how to struct.pack a unicode string?

Albert-Jan Roskam fomcl at yahoo.com
Sun Dec 2 14:27:16 CET 2012


<snip>

> 
> * some encodings are more compact than others (e.g. Latin-1 uses
>   one byte per character, while UTF-32 uses four bytes per
>   character).

I read that performance of UTF32 is better ("UTF-32 advantage: you don't need to decode 
stored data to the 32-bit Unicode 
code point for e.g. character by 
character handling. The code point is already available right there in 
your array/vector/string.").
http://stackoverflow.com/questions/496321/utf8-utf16-and-utf32
But given that utf-32 is a memory hog, should one conclude that it's usually not a good idea to use it (esp. in Python)?
 
>>  but this does not work (it yields mojibake and tofu output for
>>  some of the languages).
> 
> It would be useful to see an example of this.
> 
> But if you do your encoding/decoding correctly, using the right
> codecs, you should never get mojibake. You only get that when
> you have a mismatch between the encoding you think you have and
> the encoding you actually have.
> 
> 
>>  It's annoying if one needs to know the encoding in which each
>>  individual language should be represented. I was hoping
>>  "unicode-internal" was the way to do it, but this does not
>>  reproduce the original string when I unpack it.. :-(
> 
> Yes, encodings are annoying. The sooner that all encodings other
> than UTF-8 and UTF-32 disappear the better :)

So true ;-)

> The beauty of using UTF-8 instead of one of the many legacy
> encodings is that UTF-8 can represent any character, so you don't
> need to care about the individual language, and it is compact (at
> least for Western European languages).

Later you write "You need a variable-length struct, of course.". Is this because ASCII is a subset of UTF-8?
The thing is, the the binary format I am writing (spss .sav), uses *fixed* column widths. This means that, even 
when I only use the ascii subset of utf-8, I still need to assume the worst-case-scenario, namely 3 bytes per symbol, right?
 
> Why are you using struct for this? If you want to convert Unicode
> strings into a sequence of bytes, that's exactly what the encode
> method does. There's no need for struct.
 
I am using struct to read/write binary data. I created the ' greetings' code to test my program (and my knowledge).
As I said to Peter Otten, both were/are imperfect ;-). Struct needs a bytestring, not a unicode string, hence I needed to convert
my unicode strings first. I used these languages because I suspected I often get away with errors because 'my' encoding
(cp1252) is fairly easy.
 
> greetings = [
>         ('Arabic', 
> u'\u0627\u0644\u0633\u0644\u0627\u0645\u0020\u0639\u0644\u064a\u0643\u0645', 
> 'cp1256'),
>         ('Assamese', 
> u'\u09a8\u09ae\u09b8\u09cd\u0995\u09be\u09f0', 
> 'utf-8'),
>         ('Bengali', 
> u'\u0986\u09b8\u09b8\u09be\u09b2\u09be\u09ae\u09c1 
> \u0986\u09b2\u09be\u0987\u0995\u09c1\u09ae', 
> 'utf-8'),
>         ('English', u'Greetings and salutations', 
> 'ascii'),
>         ('Georgian', 
> u'\u10d2\u10d0\u10db\u10d0\u10e0\u10ef\u10dd\u10d1\u10d0', 
> 'utf-8'),
>         ('Kazakh', 
> u'\u0421\u04d9\u043b\u0435\u043c\u0435\u0442\u0441\u0456\u0437 
> \u0431\u0435', 'utf-8'),
>         ('Russian', 
> u'\u0417\u0434\u0440\u0430\u0432\u0441\u0442\u0432\u0443\u0439\u0442\u0435', 
> 'utf-8'),
>         ('Spanish', u'\xa1Hola!', 'cp1252'),
>         ('Swiss German', u'Gr\xfcezi', 'cp1252'),
>         ('Thai', 
> u'\u0e2a\u0e27\u0e31\u0e2a\u0e14\u0e35', 
> 'cp874'),
>         ('Walloon', u'Bondjo\xfb', 'cp1252'),
>         ]
> for language, greet, encoding in greetings:
>     print u"Hello in %s: %s" % (language, greet)
>     for enc in ('utf-8', 'utf-16', 'utf-32', encoding):
>         bytestring = greet.encode(enc)
>         print "encoded as %s gives %r" % (enc, bytestring)
>         if bytestring.decode(enc) != greet:
>             print "*** round-trip encoding/decoding failed ***"
> 
> 
> Any of the byte strings can then be written directly to a file:
> 
> f.write(bytestring)
> 
> or embedded into a struct. You need a variable-length struct, of course.
 
See above. I believe I've got it working for character data already; now I still need to check whether I can also store 
e.g. Chinese metadata in my spss file.

> My advice: stick to Python unicode strings internally, and always write
> them to files as UTF-8.


Thanks Steven, I appreciate it! 


More information about the Tutor mailing list